使用`rle`函数和`dplyr``group_by`命令来映射分组变量 [英] Using `rle` function along with `dplyr` `group_by` command to mapping grouping variable

查看:72
本文介绍了使用`rle`函数和`dplyr``group_by`命令来映射分组变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含三列的数据框,其信息类似于下面给出的数据框。现在,我希望基于 a 列中的信息提取信息搜索模式。

I have a dataframe with three columns that has information similar to the data frame given below. Now I wish to extract information search pattern based on the information in column a.

基于少数开发人员(@thelatemail和@David T)的支持,我能够使用 rle 函数,请参见此处-使用rle函数识别模式。现在,我希望继续并将分组信息添加到提取的模式中。我尝试使用 dplyr do 函数-请参阅下面的代码。但是,这是行不通的。

Based on the support from few developers (@thelatemail and @David T), I was able to identify the pattern with rle function, please see here - using rle function to identify pattern. Now, I wish to move ahead and add grouping information to the extracted pattern. I tried with dplyr do function - refer to the code below. However, this does not work.

示例数据和所需的输出也已提供,供您参考。

The example data and desired output is given as well for your reference.

##mycode that produces error - needs to be fixed
test <- data%>%
  group_by(b, c)%>%
  do(.,  data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))



##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)



## desired output

    c      b         from  to    fromCount toCount
                    <chr> <chr>     <int>   <int>
 1 A01 experimental  a     b             1       3
 2 A02 experimental  a     c             1       1
 3 A02 experimental  c     a             1       1
 4 A02 experimental  a     b             1       1
 5 A03 control       d     e             3       1
 6 A04 control       f     e             2       2

与之前的帖子相比此处,由于我们应用了分组,因此信息被压缩到 a 列。

Compared to the earlier post here, the information gets compressed since we apply grouping to the a column.

推荐答案

我们可以使用数据中的 rleid .table

We could use rleid from data.table

library(data.table)
library(dplyr)
data %>% 
  group_by(b, c, grp = rleid(a)) %>%
  summarise(from = first(a), fromCount = n()) %>% 
  mutate(to = lead(from), toCount = lead(fromCount)) %>%
  ungroup %>%
  select(-grp) %>% 
  filter(!is.na(to)) %>%
  arrange(c)
# A tibble: 6 x 6
#  b          c     from  fromCount to    toCount
#  <chr>      <chr> <chr>     <int> <chr>   <int>
#1 experiment A01   a             1 b           3
#2 experiment A02   a             1 c           1
#3 experiment A02   c             1 a           1
#4 experiment A02   a             1 b           1
#5 control    A03   d             3 e           1
#6 control    A04   f             2 e           2






或使用 rle ,然后按'b','c'和摘要分组 rle 创建一个列表列,然后从该列中提取值和长度在摘要中,在 from, fromCount的 lead 上创建 to, toCount列 filter 列在 NA 元素和 arrange 列的基础上在 c列上


Or using rle, after grouping by 'b', 'c', summarise with rle to create a list column, then extract the 'values' and 'lengths' from column in summarise, create the 'to', 'toCount' on the lead of the 'from', 'fromCount' column filter out the NA elements and arrange the rows based on the 'c' column

data %>% 
    group_by(b, c) %>%
    summarise(rl = list(rle(a)), 
              from = rl[[1]]$values, 
              fromCount = rl[[1]]$lengths) %>% 
    mutate(to = lead(from), 
           toCount = lead(fromCount)) %>%
    ungroup %>% 
    select(-rl) %>% 
    filter(!is.na(to)) %>% 
    arrange(c)
# A tibble: 6 x 6
#  b          c     from  fromCount to    toCount
#  <chr>      <chr> <chr>     <int> <chr>   <int>
#1 experiment A01   a             1 b           3
#2 experiment A02   a             1 c           1
#3 experiment A02   c             1 a           1
#4 experiment A02   a             1 b           1
#5 control    A03   d             3 e           1
#6 control    A04   f             2 e           2

我们还可以使用 map遍历 rle 列表列('rl'),提取成分,并获取长度引线 标记中的>值,使用 unnest_wider 创建列,并使用嵌套 列表结构,过滤器除去NA元素,然后排列

We could also loop over the rle list column ('rl') with map, extract the components, and take the lead of the lengths, values in a tibble, use unnest_wider to create the columns and unnest the list structure, filter out the NA elements and arrange

library(tidyr)
library(purrr)
data %>% 
     group_by(b, c) %>%
     summarise(rl = list(rle(a))) %>%
     ungroup %>%
     mutate(out = map(rl, 
          ~ tibble(from = .x$values,
                   fromCount = .x$lengths,
                   to = lead(from), 
                   toCount = lead(fromCount)))) %>%
     unnest_wider(c(out)) %>% 
     unnest(from:toCount) %>%
     filter(!is.na(to)) %>% 
     arrange(c) %>% 
     select(-rl)

这篇关于使用`rle`函数和`dplyr``group_by`命令来映射分组变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆