How to correlate categorical column in pandas?










8















I have a DataFrame df with a non-numerical column CatColumn.



 A B CatColumn
0 381.1396 7.343921 Medium
1 481.3268 6.786945 Medium
2 263.3766 7.628746 High
3 177.2400 5.225647 Medium-High


I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.










share|improve this question


























    8















    I have a DataFrame df with a non-numerical column CatColumn.



     A B CatColumn
    0 381.1396 7.343921 Medium
    1 481.3268 6.786945 Medium
    2 263.3766 7.628746 High
    3 177.2400 5.225647 Medium-High


    I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.










    share|improve this question
























      8












      8








      8


      3






      I have a DataFrame df with a non-numerical column CatColumn.



       A B CatColumn
      0 381.1396 7.343921 Medium
      1 481.3268 6.786945 Medium
      2 263.3766 7.628746 High
      3 177.2400 5.225647 Medium-High


      I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.










      share|improve this question














      I have a DataFrame df with a non-numerical column CatColumn.



       A B CatColumn
      0 381.1396 7.343921 Medium
      1 481.3268 6.786945 Medium
      2 263.3766 7.628746 High
      3 177.2400 5.225647 Medium-High


      I want to include CatColumn in the correlation analysis with other columns in the Dataframe. I tried DataFrame.corr but it does not include columns with nominal values in the correlation analysis.







      python pandas scikit-learn correlation categorical-data






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 19 '17 at 20:02









      yousraHazemyousraHazem

      6317




      6317






















          3 Answers
          3






          active

          oldest

          votes


















          14














          I am going to strongly disagree with the other comments.



          They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.



          Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.



          rawText = StringIO("""
          A B C
          0 100.1396 1.343921 Medium
          1 105.3268 1.786945 Medium
          2 200.3766 9.628746 High
          3 150.2400 4.225647 Medium-High
          """)
          myData = pd.read_csv(rawText, sep = "s+")


          Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:



          In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
          Out[226]:
          A B C
          A 1.000000 0.986493 -0.438466
          B 0.986493 1.000000 -0.579650
          C -0.438466 -0.579650 1.000000


          Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?



          What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:



          In[227]: myData['C'] = myData['C'].astype('category')
          myData['C'].cat.categories = [2,0,1]
          myData['C'] = myData['C'].astype('float')
          myData.corr()
          Out[227]:
          A B C
          A 1.000000 0.986493 0.998874
          B 0.986493 1.000000 0.982982
          C 0.998874 0.982982 1.000000


          Much better!



          Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.



          Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.






          share|improve this answer























          • This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

            – ei-grad
            Nov 13 '18 at 8:17






          • 1





            @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

            – FatihAkici
            Nov 13 '18 at 22:34











          • Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

            – ei-grad
            Nov 14 '18 at 18:11











          • @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

            – FatihAkici
            Nov 15 '18 at 19:31











          • Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

            – ei-grad
            Nov 17 '18 at 15:42


















          0














          Basically, there is no a good scientifical way to do it. I would use the following approach:
          1. Split the numeric field into n groups, where n = number of groups of the categorical field.
          2. Calculate Cramer correlation between the 2 categorical fields.






          share|improve this answer






























            -1














            The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.



            Lets take the original question dataframe. Make the category columns:



            for i in df.CatColumn.astype('category'):
            df[i] = df.CatColumn == i


            Then it is possible to calculate the correlation between every category and other columns:



            df.corr()


            Output:



             A B Medium High Medium-High
            A 1.000000 0.490608 0.914322 -0.312309 -0.743459
            B 0.490608 1.000000 0.343620 0.548589 -0.945367
            Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
            High -0.312309 0.548589 -0.577350 1.000000 -0.333333
            Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000





            share|improve this answer

























            • So how would you answer the question? Can you adjust your answer to actually answer the OP?

              – FatihAkici
              Nov 13 '18 at 22:29











            • @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

              – ei-grad
              Nov 14 '18 at 18:52







            • 1





              Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

              – FatihAkici
              Nov 14 '18 at 18:54






            • 1





              Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

              – FatihAkici
              Nov 14 '18 at 19:07






            • 1





              No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

              – ei-grad
              Nov 21 '18 at 13:38










            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47894387%2fhow-to-correlate-categorical-column-in-pandas%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            14














            I am going to strongly disagree with the other comments.



            They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.



            Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.



            rawText = StringIO("""
            A B C
            0 100.1396 1.343921 Medium
            1 105.3268 1.786945 Medium
            2 200.3766 9.628746 High
            3 150.2400 4.225647 Medium-High
            """)
            myData = pd.read_csv(rawText, sep = "s+")


            Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:



            In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
            Out[226]:
            A B C
            A 1.000000 0.986493 -0.438466
            B 0.986493 1.000000 -0.579650
            C -0.438466 -0.579650 1.000000


            Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?



            What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:



            In[227]: myData['C'] = myData['C'].astype('category')
            myData['C'].cat.categories = [2,0,1]
            myData['C'] = myData['C'].astype('float')
            myData.corr()
            Out[227]:
            A B C
            A 1.000000 0.986493 0.998874
            B 0.986493 1.000000 0.982982
            C 0.998874 0.982982 1.000000


            Much better!



            Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.



            Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.






            share|improve this answer























            • This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

              – ei-grad
              Nov 13 '18 at 8:17






            • 1





              @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

              – FatihAkici
              Nov 13 '18 at 22:34











            • Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

              – ei-grad
              Nov 14 '18 at 18:11











            • @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

              – FatihAkici
              Nov 15 '18 at 19:31











            • Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

              – ei-grad
              Nov 17 '18 at 15:42















            14














            I am going to strongly disagree with the other comments.



            They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.



            Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.



            rawText = StringIO("""
            A B C
            0 100.1396 1.343921 Medium
            1 105.3268 1.786945 Medium
            2 200.3766 9.628746 High
            3 150.2400 4.225647 Medium-High
            """)
            myData = pd.read_csv(rawText, sep = "s+")


            Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:



            In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
            Out[226]:
            A B C
            A 1.000000 0.986493 -0.438466
            B 0.986493 1.000000 -0.579650
            C -0.438466 -0.579650 1.000000


            Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?



            What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:



            In[227]: myData['C'] = myData['C'].astype('category')
            myData['C'].cat.categories = [2,0,1]
            myData['C'] = myData['C'].astype('float')
            myData.corr()
            Out[227]:
            A B C
            A 1.000000 0.986493 0.998874
            B 0.986493 1.000000 0.982982
            C 0.998874 0.982982 1.000000


            Much better!



            Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.



            Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.






            share|improve this answer























            • This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

              – ei-grad
              Nov 13 '18 at 8:17






            • 1





              @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

              – FatihAkici
              Nov 13 '18 at 22:34











            • Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

              – ei-grad
              Nov 14 '18 at 18:11











            • @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

              – FatihAkici
              Nov 15 '18 at 19:31











            • Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

              – ei-grad
              Nov 17 '18 at 15:42













            14












            14








            14







            I am going to strongly disagree with the other comments.



            They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.



            Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.



            rawText = StringIO("""
            A B C
            0 100.1396 1.343921 Medium
            1 105.3268 1.786945 Medium
            2 200.3766 9.628746 High
            3 150.2400 4.225647 Medium-High
            """)
            myData = pd.read_csv(rawText, sep = "s+")


            Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:



            In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
            Out[226]:
            A B C
            A 1.000000 0.986493 -0.438466
            B 0.986493 1.000000 -0.579650
            C -0.438466 -0.579650 1.000000


            Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?



            What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:



            In[227]: myData['C'] = myData['C'].astype('category')
            myData['C'].cat.categories = [2,0,1]
            myData['C'] = myData['C'].astype('float')
            myData.corr()
            Out[227]:
            A B C
            A 1.000000 0.986493 0.998874
            B 0.986493 1.000000 0.982982
            C 0.998874 0.982982 1.000000


            Much better!



            Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.



            Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.






            share|improve this answer













            I am going to strongly disagree with the other comments.



            They miss the main point of correlation: How much does variable 1 increase or decrease as variable 2 increases or decreases. So in the very first place, order of the ordinal variable must be preserved during factorization/encoding. If you alter the order of variables, correlation will change completely. If you are building a tree-based method, this is a non-issue but for a correlation analysis, special attention must be paid to preservation of order in an ordinal variable.



            Let me make my argument reproducible. A and B are numeric, C is ordinal categorical in the following table, which is intentionally slightly altered from the one in the question.



            rawText = StringIO("""
            A B C
            0 100.1396 1.343921 Medium
            1 105.3268 1.786945 Medium
            2 200.3766 9.628746 High
            3 150.2400 4.225647 Medium-High
            """)
            myData = pd.read_csv(rawText, sep = "s+")


            Notice: As C moves from Medium to Medium-High to High, both A and B increase monotonically. Hence we should see strong correlations between tuples (C,A) and (C,B). Let's reproduce the two proposed answers:



            In[226]: myData.assign(C=myData.C.astype('category').cat.codes).corr()
            Out[226]:
            A B C
            A 1.000000 0.986493 -0.438466
            B 0.986493 1.000000 -0.579650
            C -0.438466 -0.579650 1.000000


            Wait... What? Negative correlations? How come? Something is definitely not right. So what is going on?



            What is going on is that C is factorized according to the alphanumerical sorting of its values. [High, Medium, Medium-High] are assigned [0, 1, 2], therefore the ordering is altered: 0 < 1 < 2 implies High < Medium < Medium-High, which is not true. Hence we accidentally calculated the response of A and B as C goes from High to Medium to Medium-High. The correct answer must preserve ordering, and assign [2, 0, 1] to [High, Medium, Medium-High]. Here is how:



            In[227]: myData['C'] = myData['C'].astype('category')
            myData['C'].cat.categories = [2,0,1]
            myData['C'] = myData['C'].astype('float')
            myData.corr()
            Out[227]:
            A B C
            A 1.000000 0.986493 0.998874
            B 0.986493 1.000000 0.982982
            C 0.998874 0.982982 1.000000


            Much better!



            Note1: If you want to treat your variable as a nominal variable, you can look at things like contingency tables, Cramer's V and the like; or group the continuous variable by the nominal categories etc. I don't think it would be right, though.



            Note2: If you had another category called Low, my answer could be criticized due to the fact that I assigned equally spaced numbers to unequally spaced categories. You could make the argument that one should assign [2, 1, 1.5, 0] to [High, Medium, Medium-High, Small], which would be valid. I believe this is what people call the art part of data science.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Dec 20 '17 at 3:29









            FatihAkiciFatihAkici

            1,7551029




            1,7551029












            • This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

              – ei-grad
              Nov 13 '18 at 8:17






            • 1





              @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

              – FatihAkici
              Nov 13 '18 at 22:34











            • Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

              – ei-grad
              Nov 14 '18 at 18:11











            • @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

              – FatihAkici
              Nov 15 '18 at 19:31











            • Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

              – ei-grad
              Nov 17 '18 at 15:42

















            • This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

              – ei-grad
              Nov 13 '18 at 8:17






            • 1





              @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

              – FatihAkici
              Nov 13 '18 at 22:34











            • Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

              – ei-grad
              Nov 14 '18 at 18:11











            • @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

              – FatihAkici
              Nov 15 '18 at 19:31











            • Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

              – ei-grad
              Nov 17 '18 at 15:42
















            This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

            – ei-grad
            Nov 13 '18 at 8:17





            This is not an answer about categorical column, because categories are just converted to corresponding metric values. But if it is possible - then the column is not really a categorical column.

            – ei-grad
            Nov 13 '18 at 8:17




            1




            1





            @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

            – FatihAkici
            Nov 13 '18 at 22:34





            @ei-grad There are two types of categorical variables: Ordinal and nominal. Ordinal means the categories can be ordered, like small/medium/high, which is what the question is asking, and why I ordered them in numeric format. Nominal means categories that don't have an inherent ordering, such as male/female/other, which my "Note1" hints. I don't really understand your objection. Categorical variables (ordinal ones) can definitely be converted to numeric values, as long as the implementer knows what he is doing.

            – FatihAkici
            Nov 13 '18 at 22:34













            Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

            – ei-grad
            Nov 14 '18 at 18:11





            Possibility to order doesn't mean you could replace the category by arbitary integer values, if you do so correllation would be calculated in a wrong way.

            – ei-grad
            Nov 14 '18 at 18:11













            @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

            – FatihAkici
            Nov 15 '18 at 19:31





            @ei-grad Thanks for falsifying your claim "if it is possible - then the column is not really a categorical column" by mentioning "Possibility to order". As for incorrect calculation, first you need to understand how software packages are doing it. When you call something like corr(NumericVar, CategoricalVar), the default treatment is the conversion of CategoricalVar into integers. If one chooses that path, one must pay attention to my argument. If not, other "proper" ways are contingency tables and Cramer's V (mentioned in my Note1). Your comments are not adding any extra information.

            – FatihAkici
            Nov 15 '18 at 19:31













            Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

            – ei-grad
            Nov 17 '18 at 15:42





            Please read carefully, there is no falsifying of my previous comment. Further discussion should be moved to the chat, but I'm not sure it is needed.

            – ei-grad
            Nov 17 '18 at 15:42













            0














            Basically, there is no a good scientifical way to do it. I would use the following approach:
            1. Split the numeric field into n groups, where n = number of groups of the categorical field.
            2. Calculate Cramer correlation between the 2 categorical fields.






            share|improve this answer



























              0














              Basically, there is no a good scientifical way to do it. I would use the following approach:
              1. Split the numeric field into n groups, where n = number of groups of the categorical field.
              2. Calculate Cramer correlation between the 2 categorical fields.






              share|improve this answer

























                0












                0








                0







                Basically, there is no a good scientifical way to do it. I would use the following approach:
                1. Split the numeric field into n groups, where n = number of groups of the categorical field.
                2. Calculate Cramer correlation between the 2 categorical fields.






                share|improve this answer













                Basically, there is no a good scientifical way to do it. I would use the following approach:
                1. Split the numeric field into n groups, where n = number of groups of the categorical field.
                2. Calculate Cramer correlation between the 2 categorical fields.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jan 24 at 15:28









                cy-presscy-press

                11




                11





















                    -1














                    The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.



                    Lets take the original question dataframe. Make the category columns:



                    for i in df.CatColumn.astype('category'):
                    df[i] = df.CatColumn == i


                    Then it is possible to calculate the correlation between every category and other columns:



                    df.corr()


                    Output:



                     A B Medium High Medium-High
                    A 1.000000 0.490608 0.914322 -0.312309 -0.743459
                    B 0.490608 1.000000 0.343620 0.548589 -0.945367
                    Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
                    High -0.312309 0.548589 -0.577350 1.000000 -0.333333
                    Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000





                    share|improve this answer

























                    • So how would you answer the question? Can you adjust your answer to actually answer the OP?

                      – FatihAkici
                      Nov 13 '18 at 22:29











                    • @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                      – ei-grad
                      Nov 14 '18 at 18:52







                    • 1





                      Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                      – FatihAkici
                      Nov 14 '18 at 18:54






                    • 1





                      Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                      – FatihAkici
                      Nov 14 '18 at 19:07






                    • 1





                      No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                      – ei-grad
                      Nov 21 '18 at 13:38















                    -1














                    The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.



                    Lets take the original question dataframe. Make the category columns:



                    for i in df.CatColumn.astype('category'):
                    df[i] = df.CatColumn == i


                    Then it is possible to calculate the correlation between every category and other columns:



                    df.corr()


                    Output:



                     A B Medium High Medium-High
                    A 1.000000 0.490608 0.914322 -0.312309 -0.743459
                    B 0.490608 1.000000 0.343620 0.548589 -0.945367
                    Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
                    High -0.312309 0.548589 -0.577350 1.000000 -0.333333
                    Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000





                    share|improve this answer

























                    • So how would you answer the question? Can you adjust your answer to actually answer the OP?

                      – FatihAkici
                      Nov 13 '18 at 22:29











                    • @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                      – ei-grad
                      Nov 14 '18 at 18:52







                    • 1





                      Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                      – FatihAkici
                      Nov 14 '18 at 18:54






                    • 1





                      Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                      – FatihAkici
                      Nov 14 '18 at 19:07






                    • 1





                      No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                      – ei-grad
                      Nov 21 '18 at 13:38













                    -1












                    -1








                    -1







                    The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.



                    Lets take the original question dataframe. Make the category columns:



                    for i in df.CatColumn.astype('category'):
                    df[i] = df.CatColumn == i


                    Then it is possible to calculate the correlation between every category and other columns:



                    df.corr()


                    Output:



                     A B Medium High Medium-High
                    A 1.000000 0.490608 0.914322 -0.312309 -0.743459
                    B 0.490608 1.000000 0.343620 0.548589 -0.945367
                    Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
                    High -0.312309 0.548589 -0.577350 1.000000 -0.333333
                    Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000





                    share|improve this answer















                    The right way to correlate a categorical column with N values is to split this column into N separate boolean columns.



                    Lets take the original question dataframe. Make the category columns:



                    for i in df.CatColumn.astype('category'):
                    df[i] = df.CatColumn == i


                    Then it is possible to calculate the correlation between every category and other columns:



                    df.corr()


                    Output:



                     A B Medium High Medium-High
                    A 1.000000 0.490608 0.914322 -0.312309 -0.743459
                    B 0.490608 1.000000 0.343620 0.548589 -0.945367
                    Medium 0.914322 0.343620 1.000000 -0.577350 -0.577350
                    High -0.312309 0.548589 -0.577350 1.000000 -0.333333
                    Medium-High -0.743459 -0.945367 -0.577350 -0.333333 1.000000






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 14 '18 at 18:50

























                    answered Nov 13 '18 at 10:23









                    ei-gradei-grad

                    632613




                    632613












                    • So how would you answer the question? Can you adjust your answer to actually answer the OP?

                      – FatihAkici
                      Nov 13 '18 at 22:29











                    • @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                      – ei-grad
                      Nov 14 '18 at 18:52







                    • 1





                      Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                      – FatihAkici
                      Nov 14 '18 at 18:54






                    • 1





                      Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                      – FatihAkici
                      Nov 14 '18 at 19:07






                    • 1





                      No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                      – ei-grad
                      Nov 21 '18 at 13:38

















                    • So how would you answer the question? Can you adjust your answer to actually answer the OP?

                      – FatihAkici
                      Nov 13 '18 at 22:29











                    • @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                      – ei-grad
                      Nov 14 '18 at 18:52







                    • 1





                      Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                      – FatihAkici
                      Nov 14 '18 at 18:54






                    • 1





                      Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                      – FatihAkici
                      Nov 14 '18 at 19:07






                    • 1





                      No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                      – ei-grad
                      Nov 21 '18 at 13:38
















                    So how would you answer the question? Can you adjust your answer to actually answer the OP?

                    – FatihAkici
                    Nov 13 '18 at 22:29





                    So how would you answer the question? Can you adjust your answer to actually answer the OP?

                    – FatihAkici
                    Nov 13 '18 at 22:29













                    @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                    – ei-grad
                    Nov 14 '18 at 18:52






                    @FatihAkici I thought it is ok since it directly answers the question how to correlate categorical column in pandas, but I updated it to match the dataframe used in OP.

                    – ei-grad
                    Nov 14 '18 at 18:52





                    1




                    1





                    Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                    – FatihAkici
                    Nov 14 '18 at 18:54





                    Please re-read the question, and also check out all of the answers given. You can not find correlation between a variable A and a category of another variable Medium. That makes zero sense. The goal is to find correlation between A and CatColumn, A and B, and B and CatColumn. Sorry to say this but your answer carries no sensible information.

                    – FatihAkici
                    Nov 14 '18 at 18:54




                    1




                    1





                    Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                    – FatihAkici
                    Nov 14 '18 at 19:07





                    Correlation exists between random variables. Not on a fixed value of them. Medium is a fixed value, it doesn't change, has zero variance, hence it can not have covariance or correlation with any variable. Its correlation with anything is zero. It doesn't make sense to even try to calculate its correlation with anything.

                    – FatihAkici
                    Nov 14 '18 at 19:07




                    1




                    1





                    No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                    – ei-grad
                    Nov 21 '18 at 13:38





                    No it is, you can see the correlation values for each category in the table :). Correlation couldn't be expressed with a single number for categorical feature with several categories, it would be meaningless because the categorical feature couldn't be represented by a metric value by its definition.

                    – ei-grad
                    Nov 21 '18 at 13:38

















                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47894387%2fhow-to-correlate-categorical-column-in-pandas%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

                    Syphilis

                    Darth Vader #20