How to check if panda dataframe group have same data
I have a pandas dataframe as below
id name Base field1 field2 field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable
The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.
I tried this to validate the data on each group but it always says TRUE
Code:
result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)
final = pd.concat(result_list,1)
The expected result is
id name field1 field2 field3 Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3
Any help on this?
python-3.x pandas dataframe pandas-groupby
add a comment |
I have a pandas dataframe as below
id name Base field1 field2 field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable
The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.
I tried this to validate the data on each group but it always says TRUE
Code:
result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)
final = pd.concat(result_list,1)
The expected result is
id name field1 field2 field3 Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3
Any help on this?
python-3.x pandas dataframe pandas-groupby
What's your desired result, onlyid = 1
passes your test?
– jpp
Nov 13 '18 at 13:38
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02
add a comment |
I have a pandas dataframe as below
id name Base field1 field2 field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable
The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.
I tried this to validate the data on each group but it always says TRUE
Code:
result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)
final = pd.concat(result_list,1)
The expected result is
id name field1 field2 field3 Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3
Any help on this?
python-3.x pandas dataframe pandas-groupby
I have a pandas dataframe as below
id name Base field1 field2 field3
1 AA Y Yes Consumer Not Applicable
1 BB N Yes Consumer Not Applicable
2 CC Y Yes Consumer Not Applicable
2 DD N Yes Not Applicable Not Applicable
2 EE N No Not Applicable Modified
3 FF Y Yes Not Applicable Applicable
3 GG N Yes Not Applicable Not Applicable
3 HH N Yes Not Applicable Not Applicable
The expected result is to group this dataframe based on the ID column and check if the data on all the other columns are the same data in each group, and finally write the results.
I tried this to validate the data on each group but it always says TRUE
Code:
result_list=
for col in df.columns:
result = df.groupby(level=0)[col].apply(lambda x: len(set(x))==1)
result_list.append(result)
final = pd.concat(result_list,1)
The expected result is
id name field1 field2 field3 Error
1 AA Yes Consumer Not Applicable Pass
1 BB Yes Consumer Not Applicable Pass
2 CC Yes Consumer Not Applicable field1, field2, field3 mismatch for ID: 2
2 DD Yes Not Applicable Not Applicable field1, field2, field3 mismatch for ID: 2
2 EE No Not Applicable Modified field1, field2, field3 mismatch for ID: 2
3 FF Yes Not Applicable Applicable field3 mismatch for ID: 3
3 GG Yes Not Applicable Not Applicable field3 mismatch for ID: 3
3 HH Yes Not Applicable Not Applicable field3 mismatch for ID: 3
Any help on this?
python-3.x pandas dataframe pandas-groupby
python-3.x pandas dataframe pandas-groupby
edited Nov 14 '18 at 10:23
Akhilesh Pandey
549313
549313
asked Nov 13 '18 at 12:52
OsceriaOsceria
599
599
What's your desired result, onlyid = 1
passes your test?
– jpp
Nov 13 '18 at 13:38
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02
add a comment |
What's your desired result, onlyid = 1
passes your test?
– jpp
Nov 13 '18 at 13:38
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02
What's your desired result, only
id = 1
passes your test?– jpp
Nov 13 '18 at 13:38
What's your desired result, only
id = 1
passes your test?– jpp
Nov 13 '18 at 13:38
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02
add a comment |
2 Answers
2
active
oldest
votes
You may get what you want with the code (assuming that df
has index named id
):
def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in for id '.format(col, df.index[0])
else:
return 'pass'
result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')
result
is:
id name field1 field2 field3 0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3
EDIT - minor editions in handler
def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in for id '.format(', '.join(cols), df.index[0])
else:
return 'pass'
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
add a comment |
You could groupby
id and then agg
each column calculating the number of unique
values per group and then you know there is a mistake where that number is greater than 1:
df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1
With this output, based on which you could construct your string.
field1 field2 field3
id
1 False False False
2 True True True
3 False False True
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53281433%2fhow-to-check-if-panda-dataframe-group-have-same-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
You may get what you want with the code (assuming that df
has index named id
):
def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in for id '.format(col, df.index[0])
else:
return 'pass'
result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')
result
is:
id name field1 field2 field3 0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3
EDIT - minor editions in handler
def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in for id '.format(', '.join(cols), df.index[0])
else:
return 'pass'
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
add a comment |
You may get what you want with the code (assuming that df
has index named id
):
def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in for id '.format(col, df.index[0])
else:
return 'pass'
result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')
result
is:
id name field1 field2 field3 0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3
EDIT - minor editions in handler
def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in for id '.format(', '.join(cols), df.index[0])
else:
return 'pass'
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
add a comment |
You may get what you want with the code (assuming that df
has index named id
):
def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in for id '.format(col, df.index[0])
else:
return 'pass'
result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')
result
is:
id name field1 field2 field3 0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3
EDIT - minor editions in handler
def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in for id '.format(', '.join(cols), df.index[0])
else:
return 'pass'
You may get what you want with the code (assuming that df
has index named id
):
def handler(df):
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
return 'error in for id '.format(col, df.index[0])
else:
return 'pass'
result = df.groupby(level=0).apply(handler)
result = df.reset_index().merge(result.to_frame().reset_index(), on='id')
result
is:
id name field1 field2 field3 0
0 1 AA Yes Consumer Not Applicable pass
1 1 BB Yes Consumer Not Applicable pass
2 2 CC Yes Consumer Not Applicable error in field1 for id 2
3 2 DD Yes Not Applicable Not Applicable error in field1 for id 2
4 2 EE No Not Applicable Modified error in field1 for id 2
5 3 FF Yes Not Applicable Applicable error in field3 for id 3
6 3 GG Yes Not Applicable Not Applicable error in field3 for id 3
7 3 HH Yes Not Applicable Not Applicable error in field3 for id 3
EDIT - minor editions in handler
def handler(df):
cols = list()
for col in ['field1', 'field2', 'field3']:
if df.loc[:, col].nunique() > 1:
cols.append(col)
if cols:
return 'error in for id '.format(', '.join(cols), df.index[0])
else:
return 'pass'
edited Nov 14 '18 at 6:19
answered Nov 13 '18 at 14:52
PoolkaPoolka
1,5861211
1,5861211
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
add a comment |
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
Hi Poolka, however, this code satisfies almost the expected results. But, on the error column, it's not showing if the data mismatches on more than field. For id 2: it should write as Field1, Field2 & Field 3 mismatch for ID:2. Any thoughts?
– Osceria
Nov 13 '18 at 16:28
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
The data comparison always happens on the first field of the list and other fields are skipped.
– Osceria
Nov 14 '18 at 0:52
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
@Osceria The code in the answer is the basis that works and performs something pretty close to what you want. Feel free to modify it (column names, handler, and so on) to meet your expectations. About the issue in the comment - check the EDIT addition.
– Poolka
Nov 14 '18 at 6:25
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
It works perfectly for my requirements after making slight changes. I just posted another post with a similar question but with an extra check. stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:39
add a comment |
You could groupby
id and then agg
each column calculating the number of unique
values per group and then you know there is a mistake where that number is greater than 1:
df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1
With this output, based on which you could construct your string.
field1 field2 field3
id
1 False False False
2 True True True
3 False False True
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
add a comment |
You could groupby
id and then agg
each column calculating the number of unique
values per group and then you know there is a mistake where that number is greater than 1:
df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1
With this output, based on which you could construct your string.
field1 field2 field3
id
1 False False False
2 True True True
3 False False True
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
add a comment |
You could groupby
id and then agg
each column calculating the number of unique
values per group and then you know there is a mistake where that number is greater than 1:
df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1
With this output, based on which you could construct your string.
field1 field2 field3
id
1 False False False
2 True True True
3 False False True
You could groupby
id and then agg
each column calculating the number of unique
values per group and then you know there is a mistake where that number is greater than 1:
df[df.columns.drop('name')].groupby('id').agg(lambda x: len(x.unique()))>1
With this output, based on which you could construct your string.
field1 field2 field3
id
1 False False False
2 True True True
3 False False True
edited Nov 13 '18 at 17:52
answered Nov 13 '18 at 14:53
Franco PiccoloFranco Piccolo
1,591714
1,591714
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
add a comment |
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
-This helps. What if the column names to be validated differ in different iterations. I run this piece in a for loop which iterates with different data frames(df1,df2) and the columns of df,df2 and df3 are different. So, I don't want to hardcode the names of the fields which keep changing for other dataframes
– Osceria
Nov 13 '18 at 15:54
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
See the edit, you can pass the columns list dropping the 'name' column and then you can pass any other number of fields..
– Franco Piccolo
Nov 13 '18 at 17:52
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
Okay. In case I add another column(Base) to the dataframe(edited). For every group based on ID, there will be only one 'Y' and other rows in the group will be 'N'. Here, the values of the rows where Base='Y' should be the reference and other rows with Base 'N' should be validated against it. The distinct columns on each row should be noted as an error column. Any thoughts?
– Osceria
Nov 14 '18 at 0:49
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
That completely changes the scope of the question and the solution, I would suggest writing another question with a different input and output for clarification..
– Franco Piccolo
Nov 14 '18 at 6:53
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
Okay. Posted here stackoverflow.com/questions/53295685/…
– Osceria
Nov 14 '18 at 8:40
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53281433%2fhow-to-check-if-panda-dataframe-group-have-same-data%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What's your desired result, only
id = 1
passes your test?– jpp
Nov 13 '18 at 13:38
Hi, I've updated the dataframe and expected result. let me know if it helps
– Osceria
Nov 13 '18 at 14:02