根据列值按组对行进行聚类 [英] Clustering rows by group based on column value

查看：119 发布时间：2020/10/26 3:12:18 r dplyr seq

本文介绍了根据列值按组对行进行聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下内容：

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1))

我想要这个：

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
             Obs = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
             Cluster = c(0,1,1,1,2,2,2,3,3,3,0,0,1))

如何获取dplyr之前必须对数字1进行排序直到第一个0出现的簇列？

How can I obtain 'Cluster' column in which I have to sequence the number of 1 until the first 0 appears, with dplyr?

连续0必须保持直到出现新值为止。

Consecutive 0's have to maintain the value until a new one appears.

编辑

我该怎么做，有很多列？

How can I do that, with many columns?

假设我有99个obs列，并且我想创建99个群集，每列一个。像这样：

Suppose that I have 99 obs columns, and I would like to create 99 clusters, one for each column. Like this:

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2),
Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1),
Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1),
ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1),
ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))

推荐答案

以下是使用 rle ：

df %>% 
  group_by(ID) %>% 
  mutate(clust = with(rle(Obs), rep(cumsum(values == 1), lengths)))
# # A tibble: 13 x 4
# # Groups:   ID [2]
# ID   Obs Cluster clust
# <dbl> <dbl>   <dbl> <int>
# 1    1.    0.      0.     0
# 2    1.    1.      1.     1
# 3    1.    1.      1.     1
# 4    1.    0.      1.     1
# 5    1.    1.      2.     2
# 6    1.    0.      2.     2
# 7    1.    0.      2.     2
# 8    1.    1.      3.     3
# 9    1.    1.      3.     3
# 10    1.    1.      3.     3
# 11    2.    0.      0.     0
# 12    2.    0.      0.     0
# 13    2.    1.      1.     1

这里是它的主要部分：

rle(df$Obs)
#Run Length Encoding
#  lengths: int [1:8] 1 2 1 1 2 3 2 1
#  values : num [1:8] 0 1 0 1 0 1 0 1

这可以告诉您Obs列中每个1或0的长度是多长时间（我暂时忽略了ID分组）。

This tells you how long each stretch of 1s or 0s was in the Obs-column (I ignore the ID-grouping for now).

我们现在需要的是，累计计算1的条纹次数，并简单地将值1累加到1：

What we need now, is to count cumulatively how many times there were strectches of 1s and to do that we simply cumsum where the values are 1:

with(rle(df$Obs), cumsum(values == 1))
#[1] 0 1 1 2 2 3 3 4

到目前为止，到目前为止，我们需要将这些值重复较长的次数，因此，我们使用 rep 以及来自rle的长度信息：

So far so good, now we need to repeat those values as many times as those stretches were long, hence we use rep and the lengths information from rle:

with(rle(df$Obs), rep(cumsum(values == 1), lengths))
# [1] 0 1 1 1 2 2 2 3 3 3 3 3 4

最后，我们按ID组进行操作。

Finally, we do this by group of ID.

如果您需要为不同的obs列创建多个群集列，则可以按如下所示轻松进行操作：

If you need to create several cluster-columns for different obs-columns, you can easily do it as follows:

df %>% 
  group_by(ID) %>% 
  mutate_at(vars(starts_with("Obs")), 
            funs(cluster= with(rle(.), rep(cumsum(values == 1), lengths))))

# # A tibble: 13 x 7
# # Groups:   ID [2]
# ID  Obs1  Obs2 ClusterObs1 ClusterObs2 Obs1_cluster Obs2_cluster
# <dbl> <dbl> <dbl>       <dbl>       <dbl>        <int>        <int>
# 1    1.    0.    0.          0.          0.            0            0
# 2    1.    1.    0.          1.          0.            1            0
# 3    1.    1.    0.          1.          0.            1            0
# 4    1.    0.    1.          1.          1.            1            1
# 5    1.    1.    1.          2.          1.            2            1
# 6    1.    0.    1.          2.          1.            2            1
# 7    1.    0.    0.          2.          1.            2            1
# 8    1.    1.    1.          3.          2.            3            2
# 9    1.    1.    0.          3.          2.            3            2
# 10    1.    1.    1.          3.          3.            3            3
# 11    2.    0.    0.          0.          0.            0            0
# 12    2.    0.    0.          0.          0.            0            0
# 13    2.    1.    1.          1.          1.            1            1

其中df是：

df <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,2,2,2), Obs1 = c(0,1, 1, 0, 1,0,0, 1, 1, 1, 0,0,1), Obs2 = c(0,0, 0, 1, 1,1,0, 1, 0, 1, 0,0,1), ClusterObs1 = c(0,1,1,1,2,2,2,3,3,3,0,0,1), ClusterObs2 = c(0,0,0,1,1,1,1,2,2,3,0,0,1))

这篇关于根据列值按组对行进行聚类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据列值按组对行进行聚类 [英] Clustering rows by group based on column value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据列值按组对行进行聚类 [英] Clustering rows by group based on column value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭