How can I get the first registry with a type A and the first registry with a type B and move it to a single row
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
add a comment |
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
What if theC
row is before theA
row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichId
have typeA,C
and whichB,D
? Maybe help add anotherId
with output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.
Will be easy for me to demonstrate with this sample case
Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04
I wanted the result of the processing to be
Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04
I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.
python pandas bigdata databricks
python pandas bigdata databricks
edited Nov 10 at 13:12
Community♦
11
11
asked Nov 10 at 12:00
Gabriel Monteiro Nepomuceno
1,0911425
1,0911425
What if theC
row is before theA
row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichId
have typeA,C
and whichB,D
? Maybe help add anotherId
with output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
What if theC
row is before theA
row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish whichId
have typeA,C
and whichB,D
? Maybe help add anotherId
with output?
– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
What if the
C
row is before the A
row? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
What if the
C
row is before the A
row? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
1
Sorry, so how is possible distinguish which
Id
have type A,C
and which B,D
? Maybe help add another Id
with output?– jezrael
Nov 10 at 12:11
Sorry, so how is possible distinguish which
Id
have type A,C
and which B,D
? Maybe help add another Id
with output?– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
accepted
First filter by isin
and boolean indexing
, remove duplicates by drop_duplicates
per column Id
, set to index and rename
columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))
Last concat
together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
First filter by isin
and boolean indexing
, remove duplicates by drop_duplicates
per column Id
, set to index and rename
columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))
Last concat
together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
up vote
1
down vote
accepted
First filter by isin
and boolean indexing
, remove duplicates by drop_duplicates
per column Id
, set to index and rename
columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))
Last concat
together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
First filter by isin
and boolean indexing
, remove duplicates by drop_duplicates
per column Id
, set to index and rename
columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))
Last concat
together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
First filter by isin
and boolean indexing
, remove duplicates by drop_duplicates
per column Id
, set to index and rename
columns names:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])
df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))
Last concat
together:
df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04
edited Nov 10 at 12:50
answered Nov 10 at 12:26
jezrael
313k21250328
313k21250328
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47
1
1
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What if the
C
row is before theA
row? Do you want this per id?– Willem Van Onsem
Nov 10 at 12:01
The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05
I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07
1
Sorry, so how is possible distinguish which
Id
have typeA,C
and whichB,D
? Maybe help add anotherId
with output?– jezrael
Nov 10 at 12:11
It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14