如何使用离散的二进制属性对数据进行聚类? [英] How to cluster data with discrete binary attributes?

查看:536
本文介绍了如何使用离散的二进制属性对数据进行聚类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的数据中,有一千万个二进制属性, 但是只有其中一些是有益的,大多数是零.

In my data, there are ten millions of binary attributes, But only some of them are informative, most of them are zeros.

格式如下:

data  attribute1 attribute2 attribute3 attribute4   .........
A          0          1           0         1       .........
B          1          0           1         0       .........
C          1          1           0         1       .........
D          1          1           0         0       .........

将其集群化的聪明方法是什么? 我知道K均值聚类.但我认为这种情况不适合. 因为二进制值使距离不那么明显. 它将遭受高维诅咒的困扰. 前夕,如果我基于这些少量的信息属性进行聚类,那么它仍然具有许多属性.

What is a smart way to cluster this? I know K-means clustering. But I don't think it's suitable in this case. Because the binary value makes distances less obvious. And it will suffer form the curse of high-dimensionality. Eeve if I cluster based on those few informative attribute, it's still to many attributes.

我认为决策树很好地将这些数据聚类. 但这是一种分类算法!

I think the decision tree is nice to cluster this data. But it's a Classification algorithm!

我该怎么办?

推荐答案

您是否考虑过频繁项集挖掘?

K-means绝对不是一个好主意,但是当使用适当的距离函数(例如jaccard,hamming,dice等)时,分层聚类可能会起作用.

K-means definitely is a bad idea, but hierarchical clustering may work when using an appropriate distance function such as jaccard, hamming, dice, ...

无论如何,什么是群集?选择的算法需要适合您要查找的集群类型.对于二进制数据,基于质心的方法(例如k均值)没有意义,因为质心不太有意义.

Anyway, what is a cluster? The choice of algorithm needs to fit to the kind of cluster you want to find. On binary data, centroid-based methods such as k-means don't make sense, as centroids are not too meaningful.

如果数据是购物车"类型的信息,请考虑使用频繁的项目集挖掘,因为它可以发现重叠的子集.

If the data are "shopping cart" type of information, consider using frequent itemset mining, as it allows discovering overlapping subsets.

这篇关于如何使用离散的二进制属性对数据进行聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆