sklearn categorical data clustering

I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.

X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
 metrics.silhouette_score(X, km.labels_, sample_size=None)))

Here is my code.

How can I customize the distance function in sklearn or convert my nominal data to numeric?

asked Nov 13 '18 at 20:52

eaytan

444

Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04

1

You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15

add a comment |

X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
 metrics.silhouette_score(X, km.labels_, sample_size=None)))

Here is my code.

How can I customize the distance function in sklearn or convert my nominal data to numeric?

asked Nov 13 '18 at 20:52

eaytan

444

Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04

1

You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15

add a comment |

X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
 metrics.silhouette_score(X, km.labels_, sample_size=None)))

Here is my code.

How can I customize the distance function in sklearn or convert my nominal data to numeric?

asked Nov 13 '18 at 20:52

eaytan

444

X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)

km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)

print("k = %d, Silhouette Coefficient: %0.3f" % (x,
 metrics.silhouette_score(X, km.labels_, sample_size=None)))

Here is my code.

How can I customize the distance function in sklearn or convert my nominal data to numeric?

python scikit-learn cluster-analysis

asked Nov 13 '18 at 20:52

eaytan

444

asked Nov 13 '18 at 20:52

eaytan

444

asked Nov 13 '18 at 20:52

eaytan

444

asked Nov 13 '18 at 20:52

eaytan

444

asked Nov 13 '18 at 20:52

eaytan

444

Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04

1

You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15

add a comment |

Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04

1

You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15

Can you use the built-in sklearn labelencoder?

– G. Anderson
Nov 13 '18 at 21:04

You actually want to use OneHotEncoder.

– Andreas Mueller
Nov 14 '18 at 1:15

add a comment |

2 Answers
2

active

oldest

votes

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
 return np.concatenate([
 (x == "morning") | (x == "afternoon"),
 (x == "afternoon") | (x == "evening"),
 (x == "evening") | (x == "night"),
 (x == "night") | (x == "morning"),
 ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356],
 [1.41421356, 0. , 1.41421356, 2. ],
 [2. , 1.41421356, 0. , 1.41421356],
 [1.41421356, 2. , 1.41421356, 0. ]])

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

1

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

add a comment |

This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.

answered Nov 13 '18 at 21:12

jwil

1799

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289329%2fsklearn-categorical-data-clustering%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
 return np.concatenate([
 (x == "morning") | (x == "afternoon"),
 (x == "afternoon") | (x == "evening"),
 (x == "evening") | (x == "night"),
 (x == "night") | (x == "morning"),
 ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356],
 [1.41421356, 0. , 1.41421356, 2. ],
 [2. , 1.41421356, 0. , 1.41421356],
 [1.41421356, 2. , 1.41421356, 0. ]])

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

1

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

add a comment |

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
 return np.concatenate([
 (x == "morning") | (x == "afternoon"),
 (x == "afternoon") | (x == "evening"),
 (x == "evening") | (x == "night"),
 (x == "night") | (x == "morning"),
 ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356],
 [1.41421356, 0. , 1.41421356, 2. ],
 [2. , 1.41421356, 0. , 1.41421356],
 [1.41421356, 2. , 1.41421356, 0. ]])

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

1

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

add a comment |

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
 return np.concatenate([
 (x == "morning") | (x == "afternoon"),
 (x == "afternoon") | (x == "evening"),
 (x == "evening") | (x == "night"),
 (x == "night") | (x == "morning"),
 ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356],
 [1.41421356, 0. , 1.41421356, 2. ],
 [2. , 1.41421356, 0. , 1.41421356],
 [1.41421356, 2. , 1.41421356, 0. ]])

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

I think you have 3 options how to convert categorical features to numerical:

Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".

Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.

Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.

Code:

def two_hot(x):
 return np.concatenate([
 (x == "morning") | (x == "afternoon"),
 (x == "afternoon") | (x == "evening"),
 (x == "evening") | (x == "night"),
 (x == "night") | (x == "morning"),
 ], axis=1).astype(int)

x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)

Output:

[['morning']
 ['afternoon']
 ['evening']
 ['night']]
[[1 0 0 1]
 [1 1 0 0]
 [0 1 1 0]
 [0 0 1 1]]

Then we can measure the distances:

from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)

Output:

array([[0. , 1.41421356, 2. , 1.41421356],
 [1.41421356, 0. , 1.41421356, 2. ],
 [2. , 1.41421356, 0. , 1.41421356],
 [1.41421356, 2. , 1.41421356, 0. ]])

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

answered Nov 14 '18 at 7:59

Tomáš Přinda

32327

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

1

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

add a comment |

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

1

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.

– jwil
Nov 14 '18 at 16:40

You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.

– Tomáš Přinda
Nov 15 '18 at 6:21

thanks for two hot encoder idea :)

– eaytan
Nov 17 '18 at 12:48

add a comment |

answered Nov 13 '18 at 21:12

jwil

1799

add a comment |

answered Nov 13 '18 at 21:12

jwil

1799

add a comment |

answered Nov 13 '18 at 21:12

jwil

1799

answered Nov 13 '18 at 21:12

jwil

1799

answered Nov 13 '18 at 21:12

jwil

1799

answered Nov 13 '18 at 21:12

jwil

1799

answered Nov 13 '18 at 21:12

jwil

1799

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb