sklearn categorical data clustering










0















I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.



X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))


Here is my code.



How can I customize the distance function in sklearn or convert my nominal data to numeric?










share|improve this question






















  • Can you use the built-in sklearn labelencoder?

    – G. Anderson
    Nov 13 '18 at 21:04






  • 1





    You actually want to use OneHotEncoder.

    – Andreas Mueller
    Nov 14 '18 at 1:15















0















I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.



X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))


Here is my code.



How can I customize the distance function in sklearn or convert my nominal data to numeric?










share|improve this question






















  • Can you use the built-in sklearn labelencoder?

    – G. Anderson
    Nov 13 '18 at 21:04






  • 1





    You actually want to use OneHotEncoder.

    – Andreas Mueller
    Nov 14 '18 at 1:15













0












0








0








I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.



X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))


Here is my code.



How can I customize the distance function in sklearn or convert my nominal data to numeric?










share|improve this question














I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.



X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))


Here is my code.



How can I customize the distance function in sklearn or convert my nominal data to numeric?







python scikit-learn cluster-analysis






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 13 '18 at 20:52









eaytaneaytan

444




444












  • Can you use the built-in sklearn labelencoder?

    – G. Anderson
    Nov 13 '18 at 21:04






  • 1





    You actually want to use OneHotEncoder.

    – Andreas Mueller
    Nov 14 '18 at 1:15

















  • Can you use the built-in sklearn labelencoder?

    – G. Anderson
    Nov 13 '18 at 21:04






  • 1





    You actually want to use OneHotEncoder.

    – Andreas Mueller
    Nov 14 '18 at 1:15
















Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04





Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04




1




1





You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15





You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15












2 Answers
2






active

oldest

votes


















1














I think you have 3 options how to convert categorical features to numerical:



  1. Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

  2. Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

  3. Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:



def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)


Output:



[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]


Then we can measure the distances:



from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)


Output:



array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])





share|improve this answer























  • While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

    – jwil
    Nov 14 '18 at 16:40






  • 1





    You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

    – Tomáš Přinda
    Nov 15 '18 at 6:21











  • thanks for two hot encoder idea :)

    – eaytan
    Nov 17 '18 at 12:48


















0














This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289329%2fsklearn-categorical-data-clustering%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I think you have 3 options how to convert categorical features to numerical:



    1. Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

    2. Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

    3. Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

    Code:



    def two_hot(x):
    return np.concatenate([
    (x == "morning") | (x == "afternoon"),
    (x == "afternoon") | (x == "evening"),
    (x == "evening") | (x == "night"),
    (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

    x = np.array([["morning", "afternoon", "evening", "night"]]).T
    print(x)
    x = two_hot(x)
    print(x)


    Output:



    [['morning']
    ['afternoon']
    ['evening']
    ['night']]
    [[1 0 0 1]
    [1 1 0 0]
    [0 1 1 0]
    [0 0 1 1]]


    Then we can measure the distances:



    from sklearn.metrics.pairwise import euclidean_distances
    euclidean_distances(x)


    Output:



    array([[0. , 1.41421356, 2. , 1.41421356],
    [1.41421356, 0. , 1.41421356, 2. ],
    [2. , 1.41421356, 0. , 1.41421356],
    [1.41421356, 2. , 1.41421356, 0. ]])





    share|improve this answer























    • While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

      – jwil
      Nov 14 '18 at 16:40






    • 1





      You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

      – Tomáš Přinda
      Nov 15 '18 at 6:21











    • thanks for two hot encoder idea :)

      – eaytan
      Nov 17 '18 at 12:48















    1














    I think you have 3 options how to convert categorical features to numerical:



    1. Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

    2. Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

    3. Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

    Code:



    def two_hot(x):
    return np.concatenate([
    (x == "morning") | (x == "afternoon"),
    (x == "afternoon") | (x == "evening"),
    (x == "evening") | (x == "night"),
    (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

    x = np.array([["morning", "afternoon", "evening", "night"]]).T
    print(x)
    x = two_hot(x)
    print(x)


    Output:



    [['morning']
    ['afternoon']
    ['evening']
    ['night']]
    [[1 0 0 1]
    [1 1 0 0]
    [0 1 1 0]
    [0 0 1 1]]


    Then we can measure the distances:



    from sklearn.metrics.pairwise import euclidean_distances
    euclidean_distances(x)


    Output:



    array([[0. , 1.41421356, 2. , 1.41421356],
    [1.41421356, 0. , 1.41421356, 2. ],
    [2. , 1.41421356, 0. , 1.41421356],
    [1.41421356, 2. , 1.41421356, 0. ]])





    share|improve this answer























    • While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

      – jwil
      Nov 14 '18 at 16:40






    • 1





      You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

      – Tomáš Přinda
      Nov 15 '18 at 6:21











    • thanks for two hot encoder idea :)

      – eaytan
      Nov 17 '18 at 12:48













    1












    1








    1







    I think you have 3 options how to convert categorical features to numerical:



    1. Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

    2. Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

    3. Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

    Code:



    def two_hot(x):
    return np.concatenate([
    (x == "morning") | (x == "afternoon"),
    (x == "afternoon") | (x == "evening"),
    (x == "evening") | (x == "night"),
    (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

    x = np.array([["morning", "afternoon", "evening", "night"]]).T
    print(x)
    x = two_hot(x)
    print(x)


    Output:



    [['morning']
    ['afternoon']
    ['evening']
    ['night']]
    [[1 0 0 1]
    [1 1 0 0]
    [0 1 1 0]
    [0 0 1 1]]


    Then we can measure the distances:



    from sklearn.metrics.pairwise import euclidean_distances
    euclidean_distances(x)


    Output:



    array([[0. , 1.41421356, 2. , 1.41421356],
    [1.41421356, 0. , 1.41421356, 2. ],
    [2. , 1.41421356, 0. , 1.41421356],
    [1.41421356, 2. , 1.41421356, 0. ]])





    share|improve this answer













    I think you have 3 options how to convert categorical features to numerical:



    1. Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

    2. Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

    3. Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

    Code:



    def two_hot(x):
    return np.concatenate([
    (x == "morning") | (x == "afternoon"),
    (x == "afternoon") | (x == "evening"),
    (x == "evening") | (x == "night"),
    (x == "night") | (x == "morning"),
    ], axis=1).astype(int)

    x = np.array([["morning", "afternoon", "evening", "night"]]).T
    print(x)
    x = two_hot(x)
    print(x)


    Output:



    [['morning']
    ['afternoon']
    ['evening']
    ['night']]
    [[1 0 0 1]
    [1 1 0 0]
    [0 1 1 0]
    [0 0 1 1]]


    Then we can measure the distances:



    from sklearn.metrics.pairwise import euclidean_distances
    euclidean_distances(x)


    Output:



    array([[0. , 1.41421356, 2. , 1.41421356],
    [1.41421356, 0. , 1.41421356, 2. ],
    [2. , 1.41421356, 0. , 1.41421356],
    [1.41421356, 2. , 1.41421356, 0. ]])






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 14 '18 at 7:59









    Tomáš PřindaTomáš Přinda

    32327




    32327












    • While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

      – jwil
      Nov 14 '18 at 16:40






    • 1





      You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

      – Tomáš Přinda
      Nov 15 '18 at 6:21











    • thanks for two hot encoder idea :)

      – eaytan
      Nov 17 '18 at 12:48

















    • While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

      – jwil
      Nov 14 '18 at 16:40






    • 1





      You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

      – Tomáš Přinda
      Nov 15 '18 at 6:21











    • thanks for two hot encoder idea :)

      – eaytan
      Nov 17 '18 at 12:48
















    While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

    – jwil
    Nov 14 '18 at 16:40





    While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

    – jwil
    Nov 14 '18 at 16:40




    1




    1





    You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

    – Tomáš Přinda
    Nov 15 '18 at 6:21





    You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

    – Tomáš Přinda
    Nov 15 '18 at 6:21













    thanks for two hot encoder idea :)

    – eaytan
    Nov 17 '18 at 12:48





    thanks for two hot encoder idea :)

    – eaytan
    Nov 17 '18 at 12:48













    0














    This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.






    share|improve this answer



























      0














      This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.






      share|improve this answer

























        0












        0








        0







        This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.






        share|improve this answer













        This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 13 '18 at 21:12









        jwiljwil

        1799




        1799



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289329%2fsklearn-categorical-data-clustering%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Use pre created SQLite database for Android project in kotlin

            Darth Vader #20

            Ondo