使用dplyr进行汇总,但保留组行的日期 [英] Use dplyr to summarize but preserve date of group row

查看:92
本文介绍了使用dplyr进行汇总,但保留组行的日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下数据框:

          Date Flare Painmed_Use
1   2015-12-01     0           0
2   2015-12-02     0           0
3   2015-12-03     0           0
4   2015-12-04     0           0
5   2015-12-05     0           0
6   2015-12-06     0           1
7   2015-12-07     1           4
8   2015-12-08     1           3
9   2015-12-09     1           1
10  2015-12-10     1           0
11  2015-12-11     0           0
12  2015-12-12     0           0
13  2015-12-13     1           2
14  2015-12-14     1           3
15  2015-12-15     1           1
16  2015-12-16     0           0

$我正在尝试使用dplyr来查找每次爆发的长度以及每次爆发期间的总用药量。我当前的解决方案(灵感来自使用rle对跑步进行分组使用dplyr )时,

I'm trying to find the length of each flare as well as the total med use during each flare using dplyr. My current solution (inspired by Use rle to group by runs when using dplyr),

df %>% 
    group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}, Flare) %>%
    summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n())

提供以下输出:

     yy Flare Painmed_UseCum FlareLength
   <int> <int>          <dbl>       <int>
 1     1     0              1           6
 2     2     1              8           4
 3     3     0              0           2
 4     4     1              6           3
 5     5     0              0           1

这几乎正是我所需要的。但是,我无法弄清楚如何保留其他列,关键的是与特定耀斑的最后一行相对应的日期。因此,我要查找的输出与上面相同,但是添加了Date,例如:

This is almost exactly what I need. However, I can't figure out how to preserve other columns, the critical one being the date that corresponds to the last row of a particular flare. So, the output I'm seeking is the same as above but with the addition of the Dates, like so:

           Date      yy Flare Painmed_UseCum FlareLength
                  <int> <int>          <dbl>       <int>
 1   2015-12-06       1     0              1           6
 2   2015-12-10       2     1              8           4
 3   2015-12-12       3     0              0           2
 4   2015-12-15       4     1              6           3
 5   2015-12-16       5     0              0           1

注意:在某些方面,这是我先前的问题( R代码以按组获取时间序列数据的最大计数),但是我试图简化该问题(尽管可能对其他人有用)的尝试最终导致了这个进一步的问题。

Note: In some ways this is a follow up from a previous question of mine (R code to get max count of time series data by group) but my attempt to keep that question simpler, though perhaps useful to others, ended up necessitating this further question.

推荐答案

您可以在摘要

library(dplyr)

df %>% 
  group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths),yy$lengths)}) %>%
  summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n(), Date = max(Date))

# Groups:   yy, Flare [5]
#  Date       Flare Painmed_Use    yy
#  <date>     <int>       <int> <int>
#1 2015-12-06     0           1     1
#2 2015-12-10     1           0     2
#3 2015-12-12     0           0     3
#4 2015-12-15     1           1     4
#5 2015-12-16     0           0     5

或如果要保留更多列,则更好的方法是使用 mutate 并选择每个组中的最后一行。

Or if there are more columns to preserve better approach is to use mutate and select the last row in each group.

df %>% 
  group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}) %>%
  mutate(Painmed_UseCum = sum(Painmed_Use),FlareLength = n()) %>%
  slice(n())






要创建组,我们可以替换 rle data.table 中的 rleid 比较简单。

group_by(yy = data.table::rleid(Flare))

这篇关于使用dplyr进行汇总,但保留组行的日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆