KMeans处理分类变量 [英] KMeans dealing with categorical variable

查看:1204
本文介绍了KMeans处理分类变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为大数据文件上的Kmeans聚类算法编写mapreduce程序。每个观察由包括分类变量和数值变量的列组成。对于Kmeans,在距离计算中不包括分类变量。所以我们需要过滤掉包含分类条目的列。



我的问题是:用字符过滤出条目很容易,但是如果一列仅包含数字,但是被视为分类(如Zipcode,ID)?



谢谢!

解决方案

删除所有分类变量可能不是要走的路。您是否尝试将数据集转换为数字数据集?有不同的方法,但例如:

给定一个包含(比如说)3个类别(黑色,白色和蓝色)的分类变量a(可以说是颜色),您可以用三个新的二进制变量(a_1,a_2,a_3)替换数据集中的a。
对于一个给定的对象,这些新的二进制变量中只有一个应该等于1,所有其他的应该为零。
所以,如果一个对象有a = black,那么a_1 = 1,a_2 = 0,a_3 = 0。

你仍然需要标准化这些新变量。有不同的方法......你可以尝试a_1 = a_1-意味着(a_1)(频率)。

I am writing a mapreduce program for Kmeans clustering algorithm on a large data file. Each observation consists of columns which include both categorical and numerical variables. For Kmeans, it is not suitable to include categorical variable in the distance calculation. So we need to filter out the columns with categorical entries.

My question is: filtering out entries with characters is easy, but what if a column contains only numeric but treated as categorical (such as Zipcode, ID)?

Thank you!

解决方案

Removing all categorical variables is probably not the way to go. Did you try to transform your data set into a numerical data set? there are different methods, but for instance:

Given a categorical variable a (lets say colours) containing (say) 3 categories (black, white and blue), you can replace a in your data set with three new binary variables (a_1, a_2, a_3). For a given object, only one of these new binary variables should be equal to one, all others should be zero. So, if an object had a=black, then a_1=1, a_2=0, a_3=0.

You still need to standardise these new variables. There are different ways... you could just try a_1=a_1-mean(a_1) (the frequency).

这篇关于KMeans处理分类变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆