How to correlate categorical column in pandas?
I have a DataFrame df
with a non-numerical column CatColumn
.
A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High
I want to include CatColumn
in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr
but it does not include columns with nominal values in the correlation analysis.
python pandas scikit-learn correlation categorical-data
add a comment |
I have a DataFrame df
with a non-numerical column CatColumn
.
A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High
I want to include CatColumn
in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr
but it does not include columns with nominal values in the correlation analysis.
python pandas scikit-learn correlation categorical-data
add a comment |
I have a DataFrame df
with a non-numerical column CatColumn
.
A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High
I want to include CatColumn
in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr
but it does not include columns with nominal values in the correlation analysis.
python pandas scikit-learn correlation categorical-data
I have a DataFrame df
with a non-numerical column CatColumn
.
A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High
I want to include CatColumn
in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr
but it does not include columns with nominal values in the correlation analysis.
python pandas scikit-learn correlation categorical-data
python pandas scikit-learn correlation categorical-data
asked Dec 19 '17 at 20:02
yousraHazemyousraHazem
6317
6317
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
I am going to strongly disagree with the other comments.
They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.
Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.
rawText = StringIO("""
A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")
Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:
In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]:
A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000
Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?
What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:
In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]:
A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000
Much better!
Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.
Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something likecorr(NumericVar, CategoricalVar)
, the default treatment is the conversion ofCategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.
– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
add a comment |
Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.
add a comment |
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variableA
and a category of another variableMedium
. That makes zero sense. The goal is to find correlation betweenA
andCatColumn
,A
andB
, andB
andCatColumn
. Sorry to say this but your answer carries no sensible information.
– FatihAkici
Nov 14 '18 at 18:54
1
Correlation exists between random variables. Not on a fixed value of them.Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.
– FatihAkici
Nov 14 '18 at 19:07
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47894387%2fhow-to-correlate-categorical-column-in-pandas%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
I am going to strongly disagree with the other comments.
They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.
Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.
rawText = StringIO("""
A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")
Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:
In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]:
A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000
Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?
What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:
In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]:
A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000
Much better!
Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.
Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something likecorr(NumericVar, CategoricalVar)
, the default treatment is the conversion ofCategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.
– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
add a comment |
I am going to strongly disagree with the other comments.
They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.
Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.
rawText = StringIO("""
A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")
Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:
In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]:
A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000
Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?
What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:
In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]:
A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000
Much better!
Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.
Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something likecorr(NumericVar, CategoricalVar)
, the default treatment is the conversion ofCategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.
– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
add a comment |
I am going to strongly disagree with the other comments.
They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.
Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.
rawText = StringIO("""
A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")
Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:
In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]:
A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000
Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?
What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:
In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]:
A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000
Much better!
Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.
Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.
I am going to strongly disagree with the other comments.
They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.
Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.
rawText = StringIO("""
A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")
Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:
In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]:
A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000
Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?
What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:
In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]:
A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000
Much better!
Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.
Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.
answered Dec 20 '17 at 3:29
FatihAkiciFatihAkici
1,7551029
1,7551029
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something likecorr(NumericVar, CategoricalVar)
, the default treatment is the conversion ofCategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.
– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
add a comment |
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something likecorr(NumericVar, CategoricalVar)
, the default treatment is the conversion ofCategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.
– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.
– ei-grad
Nov 13 '18 at 8:17
1
1
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.
– FatihAkici
Nov 13 '18 at 22:34
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.
– ei-grad
Nov 14 '18 at 18:11
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like
corr(NumericVar, CategoricalVar)
, the default treatment is the conversion of CategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.– FatihAkici
Nov 15 '18 at 19:31
@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like
corr(NumericVar, CategoricalVar)
, the default treatment is the conversion of CategoricalVar
into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.– FatihAkici
Nov 15 '18 at 19:31
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.
– ei-grad
Nov 17 '18 at 15:42
add a comment |
Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.
add a comment |
Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.
add a comment |
Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.
Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.
answered Jan 24 at 15:28
cy-presscy-press
11
11
add a comment |
add a comment |
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variableA
and a category of another variableMedium
. That makes zero sense. The goal is to find correlation betweenA
andCatColumn
,A
andB
, andB
andCatColumn
. Sorry to say this but your answer carries no sensible information.
– FatihAkici
Nov 14 '18 at 18:54
1
Correlation exists between random variables. Not on a fixed value of them.Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.
– FatihAkici
Nov 14 '18 at 19:07
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
|
show 3 more comments
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variableA
and a category of another variableMedium
. That makes zero sense. The goal is to find correlation betweenA
andCatColumn
,A
andB
, andB
andCatColumn
. Sorry to say this but your answer carries no sensible information.
– FatihAkici
Nov 14 '18 at 18:54
1
Correlation exists between random variables. Not on a fixed value of them.Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.
– FatihAkici
Nov 14 '18 at 19:07
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
|
show 3 more comments
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000
The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.
Lets take the original question dataframe. Make the category columns:
for i in df.CatColumn.astype('category'):
df[i] = df.CatColumn == i
Then it is possible to calculate the correlation between every category and other columns:
df.corr()
Output:
A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000
edited Nov 14 '18 at 18:50
answered Nov 13 '18 at 10:23
ei-gradei-grad
632613
632613
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variableA
and a category of another variableMedium
. That makes zero sense. The goal is to find correlation betweenA
andCatColumn
,A
andB
, andB
andCatColumn
. Sorry to say this but your answer carries no sensible information.
– FatihAkici
Nov 14 '18 at 18:54
1
Correlation exists between random variables. Not on a fixed value of them.Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.
– FatihAkici
Nov 14 '18 at 19:07
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
|
show 3 more comments
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variableA
and a category of another variableMedium
. That makes zero sense. The goal is to find correlation betweenA
andCatColumn
,A
andB
, andB
andCatColumn
. Sorry to say this but your answer carries no sensible information.
– FatihAkici
Nov 14 '18 at 18:54
1
Correlation exists between random variables. Not on a fixed value of them.Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.
– FatihAkici
Nov 14 '18 at 19:07
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
So how would you answer the question? Can you adjust your answer to actually answer the OP?
– FatihAkici
Nov 13 '18 at 22:29
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.
– ei-grad
Nov 14 '18 at 18:52
1
1
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable
A
and a category of another variable Medium
. That makes zero sense. The goal is to find correlation between A
and CatColumn
, A
and B
, and B
and CatColumn
. Sorry to say this but your answer carries no sensible information.– FatihAkici
Nov 14 '18 at 18:54
Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable
A
and a category of another variable Medium
. That makes zero sense. The goal is to find correlation between A
and CatColumn
, A
and B
, and B
and CatColumn
. Sorry to say this but your answer carries no sensible information.– FatihAkici
Nov 14 '18 at 18:54
1
1
Correlation exists between random variables. Not on a fixed value of them.
Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.– FatihAkici
Nov 14 '18 at 19:07
Correlation exists between random variables. Not on a fixed value of them.
Medium
is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.– FatihAkici
Nov 14 '18 at 19:07
1
1
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.
– ei-grad
Nov 21 '18 at 13:38
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47894387%2fhow-to-correlate-categorical-column-in-pandas%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown