几天内集群成员关系的统计信息 [英] Statistics on cluster member relationships over several days

查看:81
本文介绍了几天内集群成员关系的统计信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设,我有连续10天对应于5个类别的每小时数据,创建为:

Assume, I have hourly data corresponding to 5 categories for consective 10 days, created as:

library(xts)
set.seed(123)
timestamp <- seq(as.POSIXct("2016-10-01"),as.POSIXct("2016-10-10 23:59:59"), by = "hour")
data <- data.frame(cat1 = rnorm(length(timestamp),150,5),
                         cat2 = rnorm(length(timestamp),130,3),
                         cat3 = rnorm(length(timestamp),150,5),
                         cat4 = rnorm(length(timestamp),100,8),
                         cat5 = rnorm(length(timestamp),200,15))
data_obj <- xts(data,timestamp) # creat time-series object
head(data_obj,2)

现在,我分别对每一天进行聚类,并使用简单的kmeans来查看这些类别相对于彼此的行为:

Now, for each day separately, I perform clustering and see how these categories behave with respect to each other using simple kmeans as:

daywise_data <- split.xts(data_obj,f="days",k=1) # split data day wise
clus_obj <- lapply(daywise_data, function(x){ # clustering day wise
  return (kmeans(t(x), 2))
})

聚类结束后,我会用

sapply(clus_obj,function(x) x$cluster) # clustering results

我发现结果为

在目视检查中,很明显cat1cat3始终保留在同一群集中.类似地,cat4cat5在10个不同的日期大多位于不同的群集中.

On visual inspection, it is clear that cat1 and cat3 always remained in the same cluster. Similarly cat4 and cat5 are mostly in different clusters on 10 different days.

除了外观检查之外,是否有任何自动方法可从此类聚类表中收集此类统计信息?

注意:这是一个虚拟的示例.我有一个数据框,其中包含连续100天的80个类别.像上面的自动摘要将减少工作量.

Note: This is a dummy example. I have a data frame containing such 80 categories over continuous 100 days. An automatic summary like above one will reduce the effort.

推荐答案

对数集群评估方法显示了解决此问题的简便方法.

Pair-counting cluster evaluation measures show an easy way to tackle this problem.

这些方法不是查看不稳定的对象-群集分配,而是查看两个对象是否在同一群集(称为对")中.

Rather than looking at object-cluster assignments, which are unstable, these methods look at whether or not two objects are in the same cluster (that is called a "pair").

因此您可以检查这些对是否随时间变化很大.

So you could check if these pairs change much over time, or not.

由于k均值是随机的,因此您可能还希望对每个时间片运行几次,因为它们可能返回不同的聚类!

Since k-means is randomized, you may also want to run it several times for every time slice, as they may return different clusterings!

然后您可以说在结果的90%中,系列1与系列2位于同一类中.等

You could then say that e.g. series 1 is in the same cluster as series 2 in 90% of the results. etc.

这篇关于几天内集群成员关系的统计信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆