混合数据时如何用K-means聚类方法处理分类数据? [英] How to deal with categorical data in K-means clustering method when we have mixed data?
问题描述
我正在使用k均值方法根据建筑物的能耗,面积(以平方米为单位)和所在位置的气候区对它们进行聚类.气候区是一个类别变量.值可以是A,B,C或D.应将其转换为数字1,因此有两个选择.首先是LabelEncoder,其次是get_dummies.当我使用它们时,结果是完全不同的.我想问哪种方法更正确使用?
I am using k-means method to cluster some buildings according to their Energy Consumption, Area (in sqm) and Climate Zone of their location. Climate Zone is a categorical variable. Values can be A,B,C or D. It should be transformed to a numerical one, so there are two options. First, LabelEncoder and second, get_dummies. When I use each of those, results are totally different. I would like to ask which method is more correct to use?
我猜是因为"get_dummies"为每个分类变量创建了更多维度,因此应该为分类变量赋予更多决策权,这通常是不利的.另一方面,使用LabelEncoder似乎也不完全正确.因为我们可以说"A = 1,B = 2,C = 3,D = 4"或"A = 3,B = 2,C = 4,D = 1"或许多其他选项.即使它们无关紧要,这也可能会改变结果.因此,我不确定哪个更适合使用.
I guess because "get_dummies" creates more dimensions for each categorical variable, should gives more decision power to the categorical variable, which is not usually favorable. On the other hand, seems that using LabelEncoder is also not totally right. Because we can say "A=1, B=2, C=3, D=4" OR "A=3, B=2, C=4, D=1" OR many other options. This may change the results even though they are indifferent. So I am not sure which one is better to be used.
赞赏任何统计或数学解释.
Any statistical or mathematical explanation is appreciated.
谢谢
**我的意思是get_dummies?
**What I mean by get_dummies?
推荐答案
我将在此处添加另一个答案.我认为我的第一个答案几乎是正确的.但是,我确实想出了一种使用K均值对文本进行聚类的方法,因此,在这里,我正在寻找有关此技术正确性"的反馈.
I'm going to add another answer here. I think my first answer was pretty much correct. However, I did figure out a way to use K-means to cluster text, so I will share that here, as I am looking for feedback regarding the 'correctness' of this technique.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
documents = ["This little kitty came to play when I was eating at a restaurant.",
"Merley has the best squooshy kitten belly.",
"Google Translate app is incredible.",
"If you open 100 tab in google you get a smiley face.",
"Best cat photo I've ever taken.",
"Climbing ninja cat.",
"Impressed with google map feedback.",
"Key promoter extension for Google Chrome."]
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)
结果:
Top terms per cluster:
Cluster 0:
eating
kitty
little
came
restaurant
play
ve
feedback
face
extension
Cluster 1:
translate
app
incredible
google
eating
impressed
feedback
face
extension
ve
Cluster 2:
climbing
ninja
cat
eating
impressed
google
feedback
face
extension
ve
Cluster 3:
kitten
belly
squooshy
merley
best
eating
google
feedback
face
extension
Cluster 4:
100
open
tab
smiley
face
google
feedback
extension
eating
climbing
Cluster 5:
chrome
extension
promoter
key
google
eating
impressed
feedback
face
ve
Cluster 6:
impressed
map
feedback
google
ve
eating
face
extension
climbing
key
Cluster 7:
ve
taken
photo
best
cat
eating
google
feedback
face
extension
这篇关于混合数据时如何用K-means聚类方法处理分类数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!