如何根据标签对用户进行聚类 [英] how to cluster users based on tags

查看:97
本文介绍了如何根据标签对用户进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据用户观看的节目的类别或标签对他们进行聚类.执行此操作的最简单/最佳算法是什么?

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?

假设我有大约 20,000 个标签和数百万个监视事件可以用作信号,是否有我可以使用 Pig/hadoop/mortar 或在 neo4j 上实现的算法?

Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?

就数据而言,我有用户、他们看过的节目以及节目的标签(通常每个节目大约有 10 个标签).

In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).

我希望在最后有 k 个集群(可能是一打?)或广泛的存储桶,我可以使用它们将我的用户分类和分组到存储桶中,并且还可以了解他们将如何划分 - 使用一个代表每个集群的标签集.

I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.

我已经看到一些帖子提出了一种分层算法,但不确定在这种情况下如何计算距离".那是两个用户之间的距离,还是一个用户和一组标签之间的距离,等等.

I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..

推荐答案

您基本上希望根据他们的标签将用户分组.

You basically want to cluster the users according to their tags.

为简单起见,假设您只有 10 个标签(而不是 20,000 个).假设用户,比如说 user_34,有第二个和第七个标签.对于这个聚类任务,user_34可以表示为10维空间中的一个,其对应的坐标为:[0,1,0,0,0,0,1,0,0,0].

To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].

在您自己的情况下,每个用户都可以类似地表示为 20,000 维空间中的一个点.您可以使用 Apache Mahout,其中包含许多有效的聚类算法,例如 K-means.

In your own case, each user can be similarly represented as a point in a 20,000-dimensional space. You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.

由于一切都在数学坐标系中进行了明确定义,因此计算任意两个用户之间的距离很容易!它可以使用任何距离函数计算,但欧几里得距离是事实上的标准.

Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.

注意: Mahout 和许多其他数据挖掘程序支持许多适用于 SPARSE 特征的格式,即您不需要插入 ...,0,0,0,0,... 文件中,但只需要指定选择了哪些标签.(参见 Mahout 中的 RandomAccessSparseVector.)

Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)

注意:我假设您只想对用户进行聚类.从集群中提取代表性信息有点棘手.例如,对于每个集群,您可以选择集群用户之间更常见的标签.或者,您可以使用信息论中的概念,例如信息增益来找出哪些标签包含有关集群的更多信息.

Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.

这篇关于如何根据标签对用户进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆