分组数据中的kmeans聚类 [英] kmeans clustering in grouped data

查看:69
本文介绍了分组数据中的kmeans聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我尝试在分组数据中查找聚类的中心.通过使用样本数据集和问题定义,我可以与每个组一起创建 kmeans 集群.但是,当涉及到给定组群的每个中心时,我都不知道如何获得它们.

好的,直到这里一切都很好.当我要提取每个组中的每个聚类中心

 集群<-组合点数%>%group_by(gr)%&%;%dplyr :: select(x,y)%>%kmeans(3)>团聚具有3个大小为594、150、36的群集的K均值群集群集的意思是:gr x y1 1.166667 6.080832 6.00748852 1.333333 4.055645 0.06541583 1.305556 1.507862 5.2417670 

我们可以看到 gr 号已更改,我不知道这些中心属于哪个组.

前进一步,查看 clust

tidy 格式

 >整齐(集群)x1 x2 x3大小的内部集群1 1.166667 6.080832 6.0074885 594 1095.3047 12 1.333333 4.055645 0.0654158 150 312.4182 23 1.305556 1.507862 5.2417670 36 115.2484 3 

我仍然看不到 gr 2 中心信息.

我希望问题解释得很清楚.如果您有任何遗漏的部分,请告诉我!预先感谢!

解决方案

kmeans 不了解dplyr分组,因此它只是找到三个整体中心,而不是每个组中的中心.此时首选的惯用法是输入数据的列表列,例如

 库(tidyverse)points_and_models<-组合点数%>%ungroup()%>%select(-cluster)%&%;%#清除,删除集群名称,以便数据折叠nest(x,y)%&%;%#将输入数据折叠到列表列中mutate(model = map(data,kmeans,3),#在输入数据的列表列上迭代模型centres = map(model,broom :: tidy))#从模型中提取数据points_and_models#>#小动作:2 x 4#>gr数据模型中心#>< chr>< list>< list>< list>#>1 1< tibble [620×2]>< S3:kmeans>< data.frame [3×5]>#>2 2< tibble [160×2]>< S3:kmeans>< data.frame [3×5]>points_and_models%>%嵌套(中心)#>#小动作:6 x 6#>gr x1 x2大小的内部集群#>< chr>< dbl>< dbl>< int>< dbl>< fct>#>1 1 4.29 5.71 158 441.1#>2 1 3.79 0.121 102 213. 2#>3 1 6.39 6.06 360 534.3#>4 2 5.94 5.88 100 194.1#>5 2 4.01 -0.127 50 97.4 2#>6 2 1.07 4.57 10 15.7 3 

请注意, cluster 列来自模型结果,而不是输入数据.

您也可以使用 do 做同样的事情,例如

  combined_points%>%group_by(gr)%&%;%do(model = kmeans(.[c('x','y')],3))%>%ungroup()%>%group_by(gr)%&%do(map_df(.$ model,broom :: tidy))%>%ungroup() 

但是 do 和按行分组在这一点上已被软弃用,并且代码变得有些混乱,正如您需要显式地 ungroup 所看到的那样这么多.

Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html

The sample data is taken from (with little modifications for add gr column) Sample data

library(dplyr)
library(broom)
library(ggplot2)

set.seed(2015)

sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)

centers_1 <- data_frame(x = c(1, 4, 6), 
                        y = c(5, 0, 6), 
                        n = sizes_1,
                        cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6), 
                        y = c(5, 0, 6), 
                        n = sizes_2,
                        cluster = factor(1:3))

points1 <- centers_1 %>% 
    group_by(cluster) %>%
    do(data_frame(x = rnorm(.$n, .$x), 
                  y = rnorm(.$n, .$y), 
                  gr="1"))

points2 <- centers_2 %>% 
    group_by(cluster) %>%
    do(data_frame(x = rnorm(.$n, .$x), 
                  y = rnorm(.$n, .$y), 
                  gr="2"))

combined_points <- rbind(points1, points2)

> combined_points
# A tibble: 780 x 4
# Groups:   cluster [3]
   cluster           x        y    gr
    <fctr>       <dbl>    <dbl> <chr>
 1       1  3.66473833 4.285771     1
 2       1  0.51540619 5.565826     1
 3       1  0.11556319 5.592178     1
 4       1  1.60513712 5.360013     1
 5       1  2.18001557 4.955883     1
 6       1  1.53998887 4.530316     1
 7       1 -1.44165622 4.561338     1
 8       1  2.35076259 5.408538     1
 9       1 -0.03060973 4.980363     1
10       1  2.22165205 5.125556     1
# ... with 770 more rows

ggplot(combined_points, aes(x, y)) +
    facet_wrap(~gr) +
    geom_point(aes(color = cluster))

ok I everything is great until here. When I want to extract each cluster center for in each group

clust <- combined_points %>% 
    group_by(gr) %>% 
    dplyr::select(x, y) %>% 
    kmeans(3)

> clust
K-means clustering with 3 clusters of sizes 594, 150, 36

Cluster means:
        gr        x         y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670

As we can see gr number is changed and I don't know these centers belongs to which group.

as we go one step forward to see tidy format of clust

> tidy(clust)
        x1       x2        x3 size  withinss cluster
1 1.166667 6.080832 6.0074885  594 1095.3047       1
2 1.333333 4.055645 0.0654158  150  312.4182       2
3 1.305556 1.507862 5.2417670   36  115.2484       3

still I can't see the gr 2 center information.

I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!

解决方案

kmeans doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.

library(tidyverse)

points_and_models <- combined_points %>% 
    ungroup() %>% select(-cluster) %>%    # cleanup, remove cluster name so data will collapse
    nest(x, y) %>%     # collapse input data into list column
    mutate(model = map(data, kmeans, 3),    # iterate model over list column of input data
           centers = map(model, broom::tidy))    # extract data from models

points_and_models
#> # A tibble: 2 x 4
#>   gr    data               model        centers             
#>   <chr> <list>             <list>       <list>              
#> 1 1     <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2     <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>

points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#>   gr       x1     x2  size withinss cluster
#>   <chr> <dbl>  <dbl> <int>    <dbl> <fct>  
#> 1 1      4.29  5.71    158    441.  1      
#> 2 1      3.79  0.121   102    213.  2      
#> 3 1      6.39  6.06    360    534.  3      
#> 4 2      5.94  5.88    100    194.  1      
#> 5 2      4.01 -0.127    50     97.4 2      
#> 6 2      1.07  4.57     10     15.7 3

Note that the cluster column is from the model results, not the input data.

You can also do the same thing with do, e.g.

combined_points %>% 
    group_by(gr) %>% 
    do(model = kmeans(.[c('x', 'y')], 3)) %>% 
    ungroup() %>% group_by(gr) %>% 
    do(map_df(.$model, broom::tidy)) %>% ungroup()

but do and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup so much.

这篇关于分组数据中的kmeans聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆