dplyr使用group_by和rowwise做分组累​​积集计数 [英] dplyr grouped cumulative set counting using group_by and rowwise do

查看:118
本文介绍了dplyr使用group_by和rowwise做分组累​​积集计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在每个行中包含值列表的组中对数据进行了分组,并且在每个组中,我想生成每行贡献的新列表值到每个组中的列表的并集。

I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group.

这是一个例子:

require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
#   id order content
# 1  a     5    A, B
# 2  a     7 A, B, C
# 3  a     3    D, E
# 4  b     1    A, B
# 5  b     9    A, B
# 6  b     4 A, B, C

我想要的输出(按每个组中的顺序排序排序)将如下所示:

My desired output (after sorting by order descending within each group) would be like:

#   id order content cc
# 1  a     7 A, B, C 3
# 2  a     5    A, B 3
# 3  a     3    D, E 5
# 4  b     9    A, B 2
# 5  b     4 A, B, C 3
# 6  b     1    A, B 3

cn(累积新)将比cc(累积计数)更优先,但上面映射到我下面的尝试,cn很容易随后计算。这是我尝试的解决方案无效:

cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. Here is my attempted solution that doesn't work:

res <- testdf %>% 
  arrange(id, desc(order)) %>% 
  mutate(n=row_number()) %>%
  group_by(id) %>%
  mutate(n1=first(n)) %>%
  rowwise() %>%
  bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
  data.frame

大部分解决方案来自:按照另一个变量分组的Cumulally粘贴(连接)值(感谢akrun)。生成的值似乎是正确的,但它们与源数据框架中的正确行不相关:

I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). The values generated seem to be correct but they are not associated with the correct rows from the source data frame:

res
#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    2
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    2
# 6  b     1    A, B 6  4    3

您可以看到(查看相当于cc以上的vars列)for group'a'value 2和3是相反的,对于组'b',第二个2和3的值被颠倒。

As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed.

实际上我确定了之上的错误, (显然)testdf $ content与dplyr'd数据帧不一致。原来我曾经有 $内容,而不是 testdf $ content ,甚至产生了甚至odder输出。所以我试了两个阶段:

Actually I worked out what is wrong above, the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. Originally I'd had .$content instead of testdf$content and that had produced even odder output. So I tried doing it in two stages:

res <- testdf %>% 
    arrange(id, desc(order)) %>% 
    mutate(n=row_number()) %>%
    group_by(id) %>%
    mutate(n1=first(n))
res <- res %>% 
    rowwise() %>%
    bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
    data.frame

这样会产生我的期望:

#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    3
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    3
# 6  b     1    A, B 6  4    3

所以我现在的问题是有一个更好的方法来引用 do()中的整个dplyr修改的数据框(以便内容正确排序) - 我认为只是当前行不是它?能够这样做会避免我不得不在 do()之前分别创建有序数据框。

So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do() (so that content is ordered correctly) - I think . is just the current row isn't it? Being able to do so would avoid me having to create the ordered data frame separately before the do().

非常感谢

Tim

推荐答案

您可以使用使用累积模式减少函数,以创建累积不同的元素,然后使用长度函数返回累积的不同的计数,这避免了 rowwise()操作:

You can use the Reduce function with the accumulate mode to create cumulatively distinct elements and then use lengths function to return the cumulative distinct counts, this avoids the rowwise() operation:

library(dplyr)
testdf %>% 
          arrange(desc(order)) %>% 
          group_by(id) %>% 
          mutate(cc = lengths(Reduce(function(x, y) unique(c(x, y)), content, acc = T))) %>% 
          arrange(id)

#Source: local data frame [6 x 4]
#Groups: id [2]

#      id order   content    cc
#  <fctr> <dbl>    <list> <int>
#1      a     7 <chr [3]>     3
#2      a     5 <chr [2]>     3
#3      a     3 <chr [2]>     5
#4      b     9 <chr [2]>     2
#5      b     4 <chr [3]>     3
#6      b     1 <chr [2]>     3

这篇关于dplyr使用group_by和rowwise做分组累​​积集计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆