KMeans处理分类变量 [英] KMeans dealing with categorical variable
问题描述
我正在为大数据文件上的Kmeans聚类算法编写mapreduce程序。每个观察由包括分类变量和数值变量的列组成。对于Kmeans,在距离计算中不包括分类变量。所以我们需要过滤掉包含分类条目的列。
我的问题是:用字符过滤出条目很容易,但是如果一列仅包含数字,但是被视为分类(如Zipcode,ID)?
谢谢!
删除所有分类变量可能不是要走的路。您是否尝试将数据集转换为数字数据集?有不同的方法,但例如:
给定一个包含(比如说)3个类别(黑色,白色和蓝色)的分类变量a(可以说是颜色),您可以用三个新的二进制变量(a_1,a_2,a_3)替换数据集中的a。
对于一个给定的对象,这些新的二进制变量中只有一个应该等于1,所有其他的应该为零。
所以,如果一个对象有a = black,那么a_1 = 1,a_2 = 0,a_3 = 0。
你仍然需要标准化这些新变量。有不同的方法......你可以尝试a_1 = a_1-意味着(a_1)(频率)。
I am writing a mapreduce program for Kmeans clustering algorithm on a large data file. Each observation consists of columns which include both categorical and numerical variables. For Kmeans, it is not suitable to include categorical variable in the distance calculation. So we need to filter out the columns with categorical entries.
My question is: filtering out entries with characters is easy, but what if a column contains only numeric but treated as categorical (such as Zipcode, ID)?
Thank you!
Removing all categorical variables is probably not the way to go. Did you try to transform your data set into a numerical data set? there are different methods, but for instance:
Given a categorical variable a (lets say colours) containing (say) 3 categories (black, white and blue), you can replace a in your data set with three new binary variables (a_1, a_2, a_3). For a given object, only one of these new binary variables should be equal to one, all others should be zero. So, if an object had a=black, then a_1=1, a_2=0, a_3=0.
You still need to standardise these new variables. There are different ways... you could just try a_1=a_1-mean(a_1) (the frequency).
这篇关于KMeans处理分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!