How can I get the first registry with a type A and the first registry with a type B and move it to a single row









up vote
1
down vote

favorite












I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question























  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01











  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14















up vote
1
down vote

favorite












I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question























  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01











  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14













up vote
1
down vote

favorite









up vote
1
down vote

favorite











I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.










share|improve this question















I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.



Will be easy for me to demonstrate with this sample case




Id,type,Datetime
1, A, 2018-01-01
1, I, 2018-01-02
1, I, 2018-01-03
1, C, 2018-01-04
2, I, 2018-01-01
2, B, 2018-01-02
2, J, 2018-01-03
2, K, 2018-01-04
2, D, 2018-01-05
2, C, 2018-01-06
3, A, 2018-01-03
3, D, 2018-01-09
4, A, 2018-01-03
4, D, 2018-01-05
4, C, 2018-01-04



I wanted the result of the processing to be




Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate
1, A, 2018-01-01, C, 2018-01-04
2, B, 2018-01-01, D, 2018-01-05
3, A, 2018-01-03, D, 2018-01-09
4, A, 2018-01-03, C, 2018-01-04



I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.







python pandas bigdata databricks






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 at 13:12









Community

11




11










asked Nov 10 at 12:00









Gabriel Monteiro Nepomuceno

1,0911425




1,0911425











  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01











  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14

















  • What if the C row is before the A row? Do you want this per id?
    – Willem Van Onsem
    Nov 10 at 12:01











  • The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:05










  • I want this per Id.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:07






  • 1




    Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
    – jezrael
    Nov 10 at 12:11










  • It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:14
















What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01





What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01













The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05




The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05












I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07




I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07




1




1




Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11




Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11












It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14





It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14













1 Answer
1






active

oldest

votes

















up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer






















  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34











  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer






















  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34











  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51














up vote
1
down vote



accepted










First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer






















  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34











  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51












up vote
1
down vote



accepted







up vote
1
down vote



accepted






First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04





share|improve this answer














First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:



df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
.drop_duplicates('Id')
.set_index('Id')
.rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))


Last concat together:



df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 10 at 12:50

























answered Nov 10 at 12:26









jezrael

313k21250328




313k21250328











  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34











  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51
















  • Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:34











  • @GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
    – jezrael
    Nov 10 at 12:39










  • Corrected the sample data.
    – Gabriel Monteiro Nepomuceno
    Nov 10 at 12:47






  • 1




    @GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
    – jezrael
    Nov 10 at 12:51















Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34





Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34













@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39




@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39












Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47




Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47




1




1




@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51




@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

Syphilis

Darth Vader #20