R中的Kmeans具有一致的聚类顺序 [英] Consistent Cluster Order with Kmeans in R

查看:111
本文介绍了R中的Kmeans具有一致的聚类顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能无法实现,但是到目前为止Google未能使我失望,因此我希望其他人可以有所了解.抱歉,以前是否有人问过这个.

This might not be possible, but Google has failed me so far so I'm hoping someone else might have some insight. Sorry if this has been asked before.

背景是,我有一个关于不同城市的信息数据库,例如按名称,人口,污染,犯罪等等.我正在查询它以按城市汇总数据并将结果输出到表中.很好.

The background is, I have a database of information on different cities, so like name, population, pollution, crime, etc by year. I'm querying it to aggregate the data on a per-city basis and outputting the result to a table. That works fine.

下一步是我在数据集上运行R中的kmeans()函数以查找聚类,在测试中,我发现通过肘方法"几乎总是5个聚类是一个不错的选择.

The next step is I'm running the kmeans() function in R on the data set to find clusters, in testing I've found that 5 clusters is almost always a good choice via the "elbow method".

我遇到的问题是这些聚类具有不同的含义/解释,因此我想用该行的聚类解释而不是聚类编号标记原始数据集中的每一行.因此,我不想将行5"标识为第二行,而是要说人口少,犯罪率高,收入低".

The issue I'm having is that these clusters have distinct meanings/interpretations, so I want to tag each row in the original data set with the cluster's interpretation for that row, not the cluster number. So I don't want to identify row 2 with "cluster 5", I want to say "low population, high crime, low income".

如果R将以相同的顺序输出集群,例如说集群5始终等于人口少,犯罪率高,收入低"的城市集群,那会很好,但事实并非如此.例如,如果您运行如下代码:

If R would output the clusters in the same order, say having cluster 5 always equate to the cluster of cities with "low population, high crime, low income", that would work fine, but it doesn't. For instance, if you run code like this:

> a =  kmeans(city_date,centers=5)
> b =  kmeans(city_date,centers=5)
> c =  kmeans(city_date,centers=5)

运行此代码:

a$centers
b$centers
c$centers

所有群集将包含相同的数据集,但是群集编号将不同.因此,如果我在SQL中有一个具有群集编号和解释的映射表,它将无法正常工作,因为当我一天运行它时,它的低人口,高犯罪率,低收入"群集可能为5,而下一个群集可能是2个,接下来的4个,依此类推.

The clusters will all contain the same data set, but the cluster number will be different. So if I have a mapping table in SQL that has cluster number and interpretation, it won't work, because when I run it one day it might have the "low population, high crime, low income" cluster as 5, and the next it might be 2, the next 4, etc.

我要弄清楚的是,是否有一种方法可以使输出保持一致.数据集会更新,因此每次都不会完全一样,而且由于R即使使用相同的数据集也无法使簇顺序保持一致,所以我想知道是否有可能.

What I'm trying to figure out is if there is a way to keep the output consistent. The data set gets updated so it won't even be the same every time, and since R doesn't keep the cluster order consistent even with the same data set, I am wondering if it will be possible at all.

感谢任何人都可以提供的帮助.最后,我目前的想法是将$ centers数据输出到SQL表,然后按各种度量对表进行排序,每次对具有最高/最低值的表进行标记,然后将结果串联起来以对级别进行标记.这可能有效,但不是很优雅.

Thanks for any help anyone can provide. On my end my current idea is to output the $centers data to a SQL table, then order the table by the various metrics, each time the one with the highest/lowest getting tagged as such, and then concatenating the results to tag the level. This may work but isn't very elegant.

推荐答案

我知道这是一篇很老的文章,但我现在才发现.今天我遇到了同样的问题,并根据Barker的建议提出了解决方案:

I know this is a very old post, but I only came across it now. I had the same problem today and adapted the suggestion by Barker to come up with a solution:

library(dplyr)

# create a random data frame
df <- data.frame(id = 1:10, obs = sample(0:500, 10))

# use kmeans a first time to get the centers
centers <- kmeans(df$obs, centers = 3)$centers

# order the centers
centers <- sort(centers)

# call kmeans again but this time passing the centers calculated in the previous step
clusteridx <- kmeans(df$obs, centers = centers)$cluster

不是很优雅,但是可以. clusteridx向量将始终根据中心升序返回簇号.

Not very elegant, but it works. The clusteridx vector will always return the cluster number based on the centers in ascending order.

如果您愿意,也可以将其折叠成一行:

This can also be collapsed into just one line if you prefer:

clusteridx <- kmeans(df$obs, centers = sort(kmeans(df$obs, centers = 3)$centers))$cluster

这篇关于R中的Kmeans具有一致的聚类顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆