R单变量按组聚类 [英] R Univariate Clustering by Group

查看:111
本文介绍了R单变量按组聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到一种按组对单变量数据进行聚类的方法。例如,在下面的数据中,每个分组我都有两个故障代码(a和b)以及6个数据点。在该图中,您可以看到每个故障代码都有2个不同的故障时间群集。手动执行此操作还不错,但是我无法弄清楚如何使用更大的数据集(约10万行和约30个代码)来实现此目的。我希望最终结果能为我提供每个群集的medoid以及该群集中的代码数。

I am trying to find a method to cluster univariate data by group. For example, in the data below I have two failure codes (a and b) and 6 data points for each grouping. In the plot you can see that for each failure code there are 2 distinct clusters for failure time. Manually this isn't bad, but I can't figure out how to do this with a larger data set (~100K rows and ~30 codes). I would like for the end result to give me the medoid for each cluster and the count of codes in that cluster.

library(ggplot2)
failure <- rep(c("a","b"),each=6)
ttf <- c(1,1.5,2,5,5.5,6,8,8.5,9,14,14.5,15)
data <- data.frame(failure,ttf)
qplot(failure, ttf)
results <- data.frame(failure = c("a","b"), m1 = c(1.5,8.5), m2 = c(5.5,14.5))

我希望最终结果能给我像桌子一样的东西

I would like for the end result to give me something like the table below.

failure m1   m1count  m2    m2count
a       1.5  3        5.5   3
b       8.5  3        14.5  3


推荐答案

这将满足您的要求,假设每个故障仅两个集群组,尽管您可以在 ta中进行更改pply 应用于所有失败组。

This is will do what you want, assuming only two clusters per failure group, though you could change it in the tapply it would apply to all failure groups.

res2 <- tapply(data$ttf, INDEX = data$failure, function(x) kmeans(x,2))    
res3 <- lapply(names(res2), function(x) data.frame(failure=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
res3 <- do.call(rbind, res3)

res3
   failure Centers Size
1        a     5.5    3
2        a     1.5    3
11       b    14.5    3
21       b     8.5    3

这篇关于R单变量按组聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆