分组数据中的kmeans聚类 [英] kmeans clustering in grouped data
问题描述
当前,我尝试在分组数据中查找聚类的中心.通过使用样本数据集和问题定义,我可以与每个组一起创建 kmeans
集群.但是,当涉及到给定组群的每个中心时,我都不知道如何获得它们.
好的,直到这里一切都很好.当我要提取每个组中的每个聚类中心
集群<-组合点数%>%group_by(gr)%&%;%dplyr :: select(x,y)%>%kmeans(3)>团聚具有3个大小为594、150、36的群集的K均值群集群集的意思是:gr x y1 1.166667 6.080832 6.00748852 1.333333 4.055645 0.06541583 1.305556 1.507862 5.2417670
我们可以看到 gr
号已更改,我不知道这些中心属于哪个组.
前进一步,查看 clust
tidy
格式 >整齐(集群)x1 x2 x3大小的内部集群1 1.166667 6.080832 6.0074885 594 1095.3047 12 1.333333 4.055645 0.0654158 150 312.4182 23 1.305556 1.507862 5.2417670 36 115.2484 3
我仍然看不到 gr 2
中心信息.
我希望问题解释得很清楚.如果您有任何遗漏的部分,请告诉我!预先感谢!
kmeans
不了解dplyr分组,因此它只是找到三个整体中心,而不是每个组中的中心.此时首选的惯用法是输入数据的列表列,例如
库(tidyverse)points_and_models<-组合点数%>%ungroup()%>%select(-cluster)%&%;%#清除,删除集群名称,以便数据折叠nest(x,y)%&%;%#将输入数据折叠到列表列中mutate(model = map(data,kmeans,3),#在输入数据的列表列上迭代模型centres = map(model,broom :: tidy))#从模型中提取数据points_and_models#>#小动作:2 x 4#>gr数据模型中心#>< chr>< list>< list>< list>#>1 1< tibble [620×2]>< S3:kmeans>< data.frame [3×5]>#>2 2< tibble [160×2]>< S3:kmeans>< data.frame [3×5]>points_and_models%>%嵌套(中心)#>#小动作:6 x 6#>gr x1 x2大小的内部集群#>< chr>< dbl>< dbl>< int>< dbl>< fct>#>1 1 4.29 5.71 158 441.1#>2 1 3.79 0.121 102 213. 2#>3 1 6.39 6.06 360 534.3#>4 2 5.94 5.88 100 194.1#>5 2 4.01 -0.127 50 97.4 2#>6 2 1.07 4.57 10 15.7 3
请注意, cluster
列来自模型结果,而不是输入数据.
您也可以使用 do
做同样的事情,例如
combined_points%>%group_by(gr)%&%;%do(model = kmeans(.[c('x','y')],3))%>%ungroup()%>%group_by(gr)%&%do(map_df(.$ model,broom :: tidy))%>%ungroup()
但是 do
和按行分组在这一点上已被软弃用,并且代码变得有些混乱,正如您需要显式地 ungroup
所看到的那样这么多.
Currently, I try to find centers of the clusters in grouped data. By using sample data set and problem definitions I am able to create kmeans
cluster withing the each group. However when it comes to address each center of the cluster for given groups I don't know how to get them. https://rdrr.io/cran/broom/man/kmeans_tidiers.html
The sample data is taken from (with little modifications for add gr
column)
Sample data
library(dplyr)
library(broom)
library(ggplot2)
set.seed(2015)
sizes_1 <- c(20, 100, 500)
sizes_2 <- c(10, 50, 100)
centers_1 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_1,
cluster = factor(1:3))
centers_2 <- data_frame(x = c(1, 4, 6),
y = c(5, 0, 6),
n = sizes_2,
cluster = factor(1:3))
points1 <- centers_1 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="1"))
points2 <- centers_2 %>%
group_by(cluster) %>%
do(data_frame(x = rnorm(.$n, .$x),
y = rnorm(.$n, .$y),
gr="2"))
combined_points <- rbind(points1, points2)
> combined_points
# A tibble: 780 x 4
# Groups: cluster [3]
cluster x y gr
<fctr> <dbl> <dbl> <chr>
1 1 3.66473833 4.285771 1
2 1 0.51540619 5.565826 1
3 1 0.11556319 5.592178 1
4 1 1.60513712 5.360013 1
5 1 2.18001557 4.955883 1
6 1 1.53998887 4.530316 1
7 1 -1.44165622 4.561338 1
8 1 2.35076259 5.408538 1
9 1 -0.03060973 4.980363 1
10 1 2.22165205 5.125556 1
# ... with 770 more rows
ggplot(combined_points, aes(x, y)) +
facet_wrap(~gr) +
geom_point(aes(color = cluster))
ok I everything is great until here. When I want to extract each cluster center for in each group
clust <- combined_points %>%
group_by(gr) %>%
dplyr::select(x, y) %>%
kmeans(3)
> clust
K-means clustering with 3 clusters of sizes 594, 150, 36
Cluster means:
gr x y
1 1.166667 6.080832 6.0074885
2 1.333333 4.055645 0.0654158
3 1.305556 1.507862 5.2417670
As we can see gr
number is changed and I don't know these centers belongs to which group.
as we go one step forward to see tidy
format of clust
> tidy(clust)
x1 x2 x3 size withinss cluster
1 1.166667 6.080832 6.0074885 594 1095.3047 1
2 1.333333 4.055645 0.0654158 150 312.4182 2
3 1.305556 1.507862 5.2417670 36 115.2484 3
still I can't see the gr 2
center information.
I hope the problem explained very clear. Let me know if you have any missing part! Thanks in advance!
kmeans
doesn't understand dplyr grouping, so it's just finding three overall centers instead of within each group. The preferred idiom at this point to do this is list columns of the input data, e.g.
library(tidyverse)
points_and_models <- combined_points %>%
ungroup() %>% select(-cluster) %>% # cleanup, remove cluster name so data will collapse
nest(x, y) %>% # collapse input data into list column
mutate(model = map(data, kmeans, 3), # iterate model over list column of input data
centers = map(model, broom::tidy)) # extract data from models
points_and_models
#> # A tibble: 2 x 4
#> gr data model centers
#> <chr> <list> <list> <list>
#> 1 1 <tibble [620 × 2]> <S3: kmeans> <data.frame [3 × 5]>
#> 2 2 <tibble [160 × 2]> <S3: kmeans> <data.frame [3 × 5]>
points_and_models %>% unnest(centers)
#> # A tibble: 6 x 6
#> gr x1 x2 size withinss cluster
#> <chr> <dbl> <dbl> <int> <dbl> <fct>
#> 1 1 4.29 5.71 158 441. 1
#> 2 1 3.79 0.121 102 213. 2
#> 3 1 6.39 6.06 360 534. 3
#> 4 2 5.94 5.88 100 194. 1
#> 5 2 4.01 -0.127 50 97.4 2
#> 6 2 1.07 4.57 10 15.7 3
Note that the cluster
column is from the model results, not the input data.
You can also do the same thing with do
, e.g.
combined_points %>%
group_by(gr) %>%
do(model = kmeans(.[c('x', 'y')], 3)) %>%
ungroup() %>% group_by(gr) %>%
do(map_df(.$model, broom::tidy)) %>% ungroup()
but do
and grouping rowwise are sort of soft-deprecated at this point, and the code gets a little janky, as you can see by the need to explicitly ungroup
so much.
这篇关于分组数据中的kmeans聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!