根据列值按组对行进行聚类 [英] Clustering rows by group based on column value

查看:119
本文介绍了根据列值按组对行进行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下内容:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1))

我想要这个:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
             Cluster = c(0,1,1,1,2,2,2,3,3,3,0,0,1))

如何获取dplyr之前必须对数字1进行排序直到第一个0出现的簇列?

How can I obtain 'Cluster' column in which I have to sequence the number of 1 until the first 0 appears, with dplyr?

连续0必须保持直到出现新值为止。

Consecutive 0's have to maintain the value until a new one appears.

编辑

我该怎么做,有很多列?

How can I do that, with many columns?

假设我有99个obs列,并且我想创建99个群集,每列一个。像这样:

Suppose that I have 99 obs columns, and I would like to create 99 clusters, one for each column. Like this:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1),
ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1),
ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))


推荐答案

以下是使用 rle

df %>% 
  group_by(ID) %>% 
  mutate(clust = with(rle(Obs), rep(cumsum(values == 1), lengths)))
# # A tibble: 13 x 4
# # Groups:   ID [2]
# ID   Obs Cluster clust
# <dbl> <dbl>   <dbl> <int>
# 1    1.    0.      0.     0
# 2    1.    1.      1.     1
# 3    1.    1.      1.     1
# 4    1.    0.      1.     1
# 5    1.    1.      2.     2
# 6    1.    0.      2.     2
# 7    1.    0.      2.     2
# 8    1.    1.      3.     3
# 9    1.    1.      3.     3
# 10    1.    1.      3.     3
# 11    2.    0.      0.     0
# 12    2.    0.      0.     0
# 13    2.    1.      1.     1

这里是它的主要部分:

rle(df$Obs)
#Run Length Encoding
#  lengths: int [1:8] 1 2 1 1 2 3 2 1
#  values : num [1:8] 0 1 0 1 0 1 0 1

这可以告诉您Obs列中每个1或0的长度是多长时间(我暂时忽略了ID分组)。

This tells you how long each stretch of 1s or 0s was in the Obs-column (I ignore the ID-grouping for now).

我们现在需要的是,累计计算1的条纹次数,并简单地将值1累加到1:

What we need now, is to count cumulatively how many times there were strectches of 1s and to do that we simply cumsum where the values are 1:

with(rle(df$Obs), cumsum(values == 1))
#[1] 0 1 1 2 2 3 3 4

到目前为止,到目前为止,我们需要将这些值重复较长的次数,因此,我们使用 rep 以及来自rle的长度信息:

So far so good, now we need to repeat those values as many times as those stretches were long, hence we use rep and the lengths information from rle:

with(rle(df$Obs), rep(cumsum(values == 1), lengths))
# [1] 0 1 1 1 2 2 2 3 3 3 3 3 4

最后,我们按ID组进行操作。

Finally, we do this by group of ID.

如果您需要为不同的obs列创建多个群集列,则可以按如下所示轻松进行操作:

If you need to create several cluster-columns for different obs-columns, you can easily do it as follows:

df %>% 
  group_by(ID) %>% 
  mutate_at(vars(starts_with("Obs")), 
            funs(cluster= with(rle(.), rep(cumsum(values == 1), lengths))))

# # A tibble: 13 x 7
# # Groups:   ID [2]
# ID  Obs1  Obs2 ClusterObs1 ClusterObs2 Obs1_cluster Obs2_cluster
# <dbl> <dbl> <dbl>       <dbl>       <dbl>        <int>        <int>
# 1    1.    0.    0.          0.          0.            0            0
# 2    1.    1.    0.          1.          0.            1            0
# 3    1.    1.    0.          1.          0.            1            0
# 4    1.    0.    1.          1.          1.            1            1
# 5    1.    1.    1.          2.          1.            2            1
# 6    1.    0.    1.          2.          1.            2            1
# 7    1.    0.    0.          2.          1.            2            1
# 8    1.    1.    1.          3.          2.            3            2
# 9    1.    1.    0.          3.          2.            3            2
# 10    1.    1.    1.          3.          3.            3            3
# 11    2.    0.    0.          0.          0.            0            0
# 12    2.    0.    0.          0.          0.            0            0
# 13    2.    1.    1.          1.          1.            1            1

其中df是:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2), Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1), Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1), ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1), ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))

这篇关于根据列值按组对行进行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆