混合数据时如何用K-means聚类方法处理分类数据? [英] How to deal with categorical data in K-means clustering method when we have mixed data?

查看:182
本文介绍了混合数据时如何用K-means聚类方法处理分类数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用k均值方法根据建筑物的能耗,面积(以平方米为单位)和所在位置的气候区对它们进行聚类.气候区是一个类别变量.值可以是A,B,C或D.应将其转换为数字1,因此有两个选择.首先是LabelEncoder,其次是get_dummies.当我使用它们时,结果是完全不同的.我想问哪种方法更正确使用?

I am using k-means method to cluster some buildings according to their Energy Consumption, Area (in sqm) and Climate Zone of their location. Climate Zone is a categorical variable. Values can be A,B,C or D. It should be transformed to a numerical one, so there are two options. First, LabelEncoder and second, get_dummies. When I use each of those, results are totally different. I would like to ask which method is more correct to use?

我猜是因为"get_dummies"为每个分类变量创建了更多维度,因此应该为分类变量赋予更多决策权,这通常是不利的.另一方面,使用LabelEncoder似乎也不完全正确.因为我们可以说"A = 1,B = 2,C = 3,D = 4"或"A = 3,B = 2,C = 4,D = 1"或许多其他选项.即使它们无关紧要,这也可能会改变结果.因此,我不确定哪个更适合使用.

I guess because "get_dummies" creates more dimensions for each categorical variable, should gives more decision power to the categorical variable, which is not usually favorable. On the other hand, seems that using LabelEncoder is also not totally right. Because we can say "A=1, B=2, C=3, D=4" OR "A=3, B=2, C=4, D=1" OR many other options. This may change the results even though they are indifferent. So I am not sure which one is better to be used.

赞赏任何统计或数学解释.

Any statistical or mathematical explanation is appreciated.

谢谢

**我的意思是get_dummies?

**What I mean by get_dummies?

推荐答案

我将在此处添加另一个答案.我认为我的第一个答案几乎是正确的.但是,我确实想出了一种使用K均值对文本进行聚类的方法,因此,在这里,我正在寻找有关此技术正确性"的反馈.

I'm going to add another answer here. I think my first answer was pretty much correct. However, I did figure out a way to use K-means to cluster text, so I will share that here, as I am looking for feedback regarding the 'correctness' of this technique.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["This little kitty came to play when I was eating at a restaurant.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tab in google you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with google map feedback.",
             "Key promoter extension for Google Chrome."]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=1000, n_init=1)
model.fit(X)

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print

print("\n")
print("Prediction")

Y = vectorizer.transform(["chrome browser to open."])
prediction = model.predict(Y)
print(prediction)

Y = vectorizer.transform(["My cat is hungry."])
prediction = model.predict(Y)
print(prediction)

结果:

Top terms per cluster:
Cluster 0:
 eating
 kitty
 little
 came
 restaurant
 play
 ve
 feedback
 face
 extension
Cluster 1:
 translate
 app
 incredible
 google
 eating
 impressed
 feedback
 face
 extension
 ve
Cluster 2:
 climbing
 ninja
 cat
 eating
 impressed
 google
 feedback
 face
 extension
 ve
Cluster 3:
 kitten
 belly
 squooshy
 merley
 best
 eating
 google
 feedback
 face
 extension
Cluster 4:
 100
 open
 tab
 smiley
 face
 google
 feedback
 extension
 eating
 climbing
Cluster 5:
 chrome
 extension
 promoter
 key
 google
 eating
 impressed
 feedback
 face
 ve
Cluster 6:
 impressed
 map
 feedback
 google
 ve
 eating
 face
 extension
 climbing
 key
Cluster 7:
 ve
 taken
 photo
 best
 cat
 eating
 google
 feedback
 face
 extension

这篇关于混合数据时如何用K-means聚类方法处理分类数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆