How can I get the first registry with a type A and the first registry with a type B and move it to a single row

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

Id,type,Datetime 1, A, 2018-01-01 1, I, 2018-01-02 1, I, 2018-01-03 1, C, 2018-01-04 2, I, 2018-01-01 2, B, 2018-01-02 2, J, 2018-01-03 2, K, 2018-01-04 2, D, 2018-01-05 2, C, 2018-01-06 3, A, 2018-01-03 3, D, 2018-01-09 4, A, 2018-01-03 4, D, 2018-01-05 4, C, 2018-01-04

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

I am currently trying to do this using python and pandas in a Databricks cluster but I am opening to Other ideas I need to to that in files that come up to a total of few TB so it is a large dataset so speed must be in mind and I can't store the whole dataset in memory.

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

up vote
1
down vote

favorite

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

I am trying to use python and pandas to get the first registry with the type A or B and the first registry with the type C or D and move that to a single row. I want this per Id.

Will be easy for me to demonstrate with this sample case

I wanted the result of the processing to be

Id, ProcessEvent, ProcessDate, DispatchedEvent, DispatchedDate 1, A, 2018-01-01, C, 2018-01-04 2, B, 2018-01-01, D, 2018-01-05 3, A, 2018-01-03, D, 2018-01-09 4, A, 2018-01-03, C, 2018-01-04

python pandas bigdata databricks

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

edited Nov 10 at 13:12

Community♦

edited Nov 10 at 13:12

Community♦

edited Nov 10 at 13:12

Community♦

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

asked Nov 10 at 12:00

Gabriel Monteiro Nepomuceno

1,0911425

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

1

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

What if the C row is before the A row? Do you want this per id?
– Willem Van Onsem
Nov 10 at 12:01

The way this data it is structured there is no chance of the c row be before the A row but if it is I would still get the time of A or B and the time of C. Doesn't matter if C is before A or B
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:05

I want this per Id.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:07

Sorry, so how is possible distinguish which Id have type A,C and which B,D ? Maybe help add another Id with output?
– jezrael
Nov 10 at 12:11

It doesn't matter if the Id has A,C or B,D or A,D or B,C I want the first A or B and the first C or D. Added another example Id 3
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:14

add a comment |

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
 Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238724%2fhow-can-i-get-the-first-registry-with-a-type-a-and-the-first-registry-with-a-typ%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
 Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
 Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

up vote
1
down vote

accepted

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
 Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

First filter by isin and boolean indexing, remove duplicates by drop_duplicates per column Id, set to index and rename columns names:

df['Datetime'] = pd.to_datetime(df['Datetime'])
df = df.sort_values(['Id','Datetime'])

df1 = (df[df['type'].isin(['A','B'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'ProcessEvent','Datetime':'ProcessDate'))
df2 = (df[df['type'].isin(['C','D'])]
 .drop_duplicates('Id')
 .set_index('Id')
 .rename(columns='type':'DispatchedEvent','Datetime':'DispatchedDate'))

Last concat together:

df = pd.concat([df1, df2], axis=1).reset_index()
print (df)
 Id ProcessEvent ProcessDate DispatchedEvent DispatchedDate
0 1 A 2018-01-01 C 2018-01-04
1 2 B 2018-01-02 D 2018-01-05
2 3 A 2018-01-03 D 2018-01-09
3 4 A 2018-01-03 C 2018-01-04

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

edited Nov 10 at 12:50

answered Nov 10 at 12:26

jezrael

313k21250328

answered Nov 10 at 12:26

jezrael

313k21250328

answered Nov 10 at 12:26

jezrael

313k21250328

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

1

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

Does this works with the case of C being before D in the file but with the earlier Date? For example ``` Id,type,Datetime 4, A, 2018-01-01 4, C, 2018-01-03 4, D, 2018-01-02 ``` would that generate 4,A,2018-01-0,D,2018-01-02 in case not I can just add one order by date before.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:34

@GabrielMonteiroNepomuceno - Can you change sample data? Maybe sorting is necessary first.
– jezrael
Nov 10 at 12:39

Corrected the sample data.
– Gabriel Monteiro Nepomuceno
Nov 10 at 12:47

@GabrielMonteiroNepomuceno - First convert to datetime and sorting by columns Id and Datetime, check edited answer.
– jezrael
Nov 10 at 12:51

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb