sklearn categorical data clustering
I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.
X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)
km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)
print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))
Here is my code.
How can I customize the distance function in sklearn or convert my nominal data to numeric?
python scikit-learn cluster-analysis
add a comment |
I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.
X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)
km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)
print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))
Here is my code.
How can I customize the distance function in sklearn or convert my nominal data to numeric?
python scikit-learn cluster-analysis
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
1
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15
add a comment |
I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.
X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)
km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)
print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))
Here is my code.
How can I customize the distance function in sklearn or convert my nominal data to numeric?
python scikit-learn cluster-analysis
I'm using sklearn and agglomerative clustering function. I have a mixed data which includes both numeric and nominal data columns. My nominal columns have values such that "Morning", "Afternoon", "Evening", "Night". If I convert my nominal data to numeric by assigning integer values like 0,1,2,3; euclidean distance will be calculated as 3 between "Night" and "Morning", but, 1 should be return value as a distance.
X = pd.read_csv("mydata.csv", sep=",", header=0, encoding="utf-8")
X = StandardScaler().fit_transform(X)
print("n_samples: %d, n_features: %d" % X.shape)
km = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='average')
km.fit(X)
print("k = %d, Silhouette Coefficient: %0.3f" % (x,
metrics.silhouette_score(X, km.labels_, sample_size=None)))
Here is my code.
How can I customize the distance function in sklearn or convert my nominal data to numeric?
python scikit-learn cluster-analysis
python scikit-learn cluster-analysis
asked Nov 13 '18 at 20:52
eaytaneaytan
444
444
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
1
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15
add a comment |
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
1
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
1
1
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15
add a comment |
2 Answers
2
active
oldest
votes
I think you have 3 options how to convert categorical features to numerical:
- Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
- Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
- Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
add a comment |
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289329%2fsklearn-categorical-data-clustering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
I think you have 3 options how to convert categorical features to numerical:
- Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
- Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
- Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
add a comment |
I think you have 3 options how to convert categorical features to numerical:
- Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
- Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
- Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
add a comment |
I think you have 3 options how to convert categorical features to numerical:
- Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
- Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
- Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
I think you have 3 options how to convert categorical features to numerical:
- Use OneHotEncoder. You will transform categorical feature to four new columns, where will be just one 1 and other 0. The problem here is that difference between "morning" and "afternoon" is the same as the same as "morning" and "evening".
- Use OrdinalEncoder. You transform categorical feature to just one column. "morning" to 1, "afternoon" to 2 etc. The difference between "morning" and "afternoon" will be smaller than "morning" and "evening" which is good, but the difference between "morning" and "night" will be greatest which might not be what you want.
- Use transformation that I call two_hot_encoder. It is similar to OneHotEncoder, there are just two 1 in the row. The difference between The difference between "morning" and "afternoon" will be the same as the difference between "morning" and "night" and it will be smaller than difference between "morning" and "evening". I think this is the best solution. Check the code.
Code:
def two_hot(x):
return np.concatenate([
(x == "morning") | (x == "afternoon"),
(x == "afternoon") | (x == "evening"),
(x == "evening") | (x == "night"),
(x == "night") | (x == "morning"),
], axis=1).astype(int)
x = np.array([["morning", "afternoon", "evening", "night"]]).T
print(x)
x = two_hot(x)
print(x)
Output:
[['morning']
['afternoon']
['evening']
['night']]
[[1 0 0 1]
[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]
Then we can measure the distances:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances(x)
Output:
array([[0. , 1.41421356, 2. , 1.41421356],
[1.41421356, 0. , 1.41421356, 2. ],
[2. , 1.41421356, 0. , 1.41421356],
[1.41421356, 2. , 1.41421356, 0. ]])
answered Nov 14 '18 at 7:59
Tomáš PřindaTomáš Přinda
32327
32327
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
add a comment |
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
While chronologically morning should be closer to afternoon than to evening for example, qualitatively in the data there may not be reason to assume that that is the case. One hot encoding leaves it to the machine to calculate which categories are the most similar. I like the idea behind your two hot encoding method but it may be forcing one's own assumptions onto the data.
– jwil
Nov 14 '18 at 16:40
1
1
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
You are right that it depends on the task. For some tasks it might be better to consider each daytime differently. But the statement "One hot encoding leaves it to the machine to calculate which categories are the most similar" is not true for clustering. Clustering calculates clusters based on distances of examples, which is based on features. So we should design features to that similar examples should have feature vectors with short distance.
– Tomáš Přinda
Nov 15 '18 at 6:21
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
thanks for two hot encoder idea :)
– eaytan
Nov 17 '18 at 12:48
add a comment |
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
add a comment |
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
add a comment |
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
This problem is common to machine learning applications. You need to define one category as the base category (it doesn't matter which) then define indicator variables (0 or 1) for each of the other categories. In other words, create 3 new variables called "Morning", "Afternoon", and "Evening", and assign a one to whichever category each observation has. If it's a night observation, leave each of these new variables as 0.
answered Nov 13 '18 at 21:12
jwiljwil
1799
1799
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289329%2fsklearn-categorical-data-clustering%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you use the built-in sklearn labelencoder?
– G. Anderson
Nov 13 '18 at 21:04
1
You actually want to use OneHotEncoder.
– Andreas Mueller
Nov 14 '18 at 1:15