How to correlate categorical column in pandas?

I have a DataFrame df with a non-numerical column CatColumn.

 A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High

I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.

asked Dec 19 '17 at 20:02

yousraHazem

6317

add a comment |

I have a DataFrame df with a non-numerical column CatColumn.

 A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High

asked Dec 19 '17 at 20:02

yousraHazem

6317

add a comment |

I have a DataFrame df with a non-numerical column CatColumn.

 A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High

asked Dec 19 '17 at 20:02

yousraHazem

6317

I have a DataFrame df with a non-numerical column CatColumn.

 A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High

python pandas scikit-learn correlation categorical-data

asked Dec 19 '17 at 20:02

yousraHazem

6317

asked Dec 19 '17 at 20:02

yousraHazem

6317

asked Dec 19 '17 at 20:02

yousraHazem

6317

asked Dec 19 '17 at 20:02

yousraHazem

6317

asked Dec 19 '17 at 20:02

yousraHazem

6317

add a comment |

3 Answers
3

active

oldest

votes

I am going to strongly disagree with the other comments.

They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""
 A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")

Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
 A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
 A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000

Much better!

Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.

Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

1

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

add a comment |

Basically, there is no a good scientifical way to do it. I would use the following approach:
1. Split the numeric field into n groups, where n = number of groups of the categorical field.
2. Calculate Cramer correlation between the 2 categorical fields.

answered Jan 24 at 15:28

cy-press

add a comment |

-1

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

Lets take the original question dataframe. Make the category columns:

for i in df.CatColumn.astype('category'):
 df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

df.corr()

Output:

 A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

1

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

1

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

1

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47894387%2fhow-to-correlate-categorical-column-in-pandas%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

I am going to strongly disagree with the other comments.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""
 A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
 A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
 A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000

Much better!

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

1

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

add a comment |

I am going to strongly disagree with the other comments.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""
 A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
 A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
 A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000

Much better!

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

1

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

add a comment |

I am going to strongly disagree with the other comments.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""
 A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
 A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
 A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000

Much better!

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

I am going to strongly disagree with the other comments.

Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.

rawText = StringIO("""
 A B C
0 100.1396 1.343921 Medium
1 105.3268 1.786945 Medium
2 200.3766 9.628746 High
3 150.2400 4.225647 Medium-High
""")
myData = pd.read_csv(rawText, sep = "s+")

In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
Out[226]: 
 A B C
A 1.000000 0.986493 -0.438466
B 0.986493 1.000000 -0.579650
C -0.438466 -0.579650 1.000000

Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?

In[227]: myData['C'] = myData['C'].astype('category')
myData['C'].cat.categories = [2,0,1]
myData['C'] = myData['C'].astype('float')
myData.corr()
Out[227]: 
 A B C
A 1.000000 0.986493 0.998874
B 0.986493 1.000000 0.982982
C 0.998874 0.982982 1.000000

Much better!

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

answered Dec 20 '17 at 3:29

FatihAkici

1,7551029

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

1

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

add a comment |

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

1

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

– ei-grad
Nov 13 '18 at 8:17

@ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

– FatihAkici
Nov 13 '18 at 22:34

Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

– ei-grad
Nov 14 '18 at 18:11

@ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

– FatihAkici
Nov 15 '18 at 19:31

Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

– ei-grad
Nov 17 '18 at 15:42

add a comment |

answered Jan 24 at 15:28

cy-press

add a comment |

answered Jan 24 at 15:28

cy-press

add a comment |

answered Jan 24 at 15:28

cy-press

answered Jan 24 at 15:28

cy-press

answered Jan 24 at 15:28

cy-press

answered Jan 24 at 15:28

cy-press

answered Jan 24 at 15:28

cy-press

add a comment |

-1

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

Lets take the original question dataframe. Make the category columns:

for i in df.CatColumn.astype('category'):
 df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

df.corr()

Output:

 A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

1

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

1

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

1

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

|
show 3 more comments

-1

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

Lets take the original question dataframe. Make the category columns:

for i in df.CatColumn.astype('category'):
 df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

df.corr()

Output:

 A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

1

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

1

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

1

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

|
show 3 more comments

-1

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

Lets take the original question dataframe. Make the category columns:

for i in df.CatColumn.astype('category'):
 df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

df.corr()

Output:

 A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.

Lets take the original question dataframe. Make the category columns:

for i in df.CatColumn.astype('category'):
 df[i] = df.CatColumn == i

Then it is possible to calculate the correlation between every category and other columns:

df.corr()

Output:

 A B Medium High Medium-High
A 1.000000 0.490608 0.914322 -0.312309 -0.743459
B 0.490608 1.000000 0.343620 0.548589 -0.945367
Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
High -0.312309 0.548589 -0.577350 1.000000 -0.333333
Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

edited Nov 14 '18 at 18:50

answered Nov 13 '18 at 10:23

ei-grad

632613

answered Nov 13 '18 at 10:23

ei-grad

632613

answered Nov 13 '18 at 10:23

ei-grad

632613

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

1

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

1

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

1

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

|
show 3 more comments

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

1

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

1

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

1

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

So how would you answer the question? Can you adjust your answer to actually answer the OP?

– FatihAkici
Nov 13 '18 at 22:29

@FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

– ei-grad
Nov 14 '18 at 18:52

Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

– FatihAkici
Nov 14 '18 at 18:54

Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

– FatihAkici
Nov 14 '18 at 19:07

No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

– ei-grad
Nov 21 '18 at 13:38

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb