您可以将dplyr :: mutate和dplyr :: lag设为默认值=它自己的输入值吗? [英] Can you make dplyr::mutate and dplyr::lag default = its own input value?

查看:175
本文介绍了您可以将dplyr :: mutate和dplyr :: lag设为默认值=它自己的输入值吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这类似于此dplyr滞后帖子,并且此dplyr突变滞后文章,但没有一个人问这个问题关于默认为输入值.我正在使用dplyr突变一个新字段,该字段是另一个字段的滞后偏移量(已转换为POSIXct).我们的目标是,对于给定的ip,我想知道所有出现在列表中的时间之间的变化量的摘要统计信息.我也有大约1200万行.

This is similar to this dplyr lag post, and this dplyr mutate lag post, but neither of those ask this question about defaulting to the input value. I am using dplyr to mutate a new field that's a lagged offset of another field (that I've converted to POSIXct). The goal is, for a given ip, I'd like to know some summary statistics on the delta between all the times it shows up on my list. I also have about 12 million rows.

数据看起来像这样(突变之前)

The data look like this (prior to mutation)

ip             hour         snap
192.168.1.2    2017070700    0
192.168.1.2    2017070700   15
192.168.1.4    2017070700    0
192.168.1.4    2017070701   45
192.168.1.4    2017070702   30
192.168.1.7    2017070700   15

小时"是整数,但应为时间戳.

'hour' is an integer, but should be a timestamp.

快照"是代表15分钟增量的4个快照"值之一.

'snap' is one of 4 'snapshot' values that represent 15 minute increments.

这是data.frame创建代码:

Here's the data.frame creation code:

test <- data.frame(ip=c("192.168.1.2","192.168.1.2","192.168.1.4","192.168.1.4","192.168.1.4","192.168.1.7"), hour=c(2017070700,2017070700,2017070700,2017070701,2017070702,2017070700), snap=c(0,15,0,45,30,15))

每个ip有数百甚至有时数千个时间戳.下面的代码使用dplyr进行

There are hundreds and sometimes thousands of timestamps per ip. The code below uses dplyr to

  • a)用0开头填充0,
  • b)将两个整数日期"字段合并为一个字段
  • c)将合并的整数日期"字段转换为POSIX日期,
  • d)按IP分组,
  • e)将滞后于旧时间戳的新列更改为1,如果值为NA,请返回原始值(这是不起作用的位),并且
  • f)更改一个新列,该列采用当前时间与上一个时间的差值(通过ip).

这些步骤引用每行末尾的注释.

These steps refer to the comments at the end of each line.

timedelta <- test %>% 
  mutate(snap = formatC(snap, width=2, flag=0)) %>%                      # a) 
  mutate(fulldateint = paste(hour, snap, sep="")) %>%                    # b) 
  mutate(fulldate = as.POSIXct(strptime(fulldateint, "%Y%m%d%H%M"))) %>% # c) 
  group_by(ip) %>%                                                       # d) 
  mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%           # e) 
  mutate(diff = fulldate-shifted)                                        # f) 

突变后,数据应如下所示:

After mutation, the data should look like this:

           ip       hour  snap  fulldateint            fulldate             shifted      diff
       <fctr>      <dbl> <chr>        <chr>              <dttm>              <dttm>    <time>
1 192.168.1.2 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
2 192.168.1.2 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:00:00  900 secs
3 192.168.1.4 2017070700    00 201707070000 2017-07-07 00:00:00 2017-07-07 00:00:00    0 secs
4 192.168.1.4 2017070701    45 201707070145 2017-07-07 01:45:00 2017-07-07 00:00:00 6300 secs
5 192.168.1.4 2017070702    30 201707070230 2017-07-07 02:30:00 2017-07-07 01:45:00 2700 secs
6 192.168.1.7 2017070700    15 201707070015 2017-07-07 00:15:00 2017-07-07 00:15:00    0 secs

如果我可以将默认值滞后于其原始值,则当'delta-T'不具有先前值(这是期望的结果)时,它将始终为0.

And if I could get lag to default to its original value, the 'delta-T' would always be 0 when it doesn't have a previous value (which is the desired result).

但是,dplyr::lag(fulldate, default=fulldate)会引发错误

Error in mutate_impl(.data, dots) : 
Column `shifted` must be length 2 (the group size) or one, not 3

如果我使用fulldate 1 ,它确实可以工作,但是我丢失了group_by(ip)结果,这是必需的.可以在dplyr中使滞后引用其自身的输入吗?

It does work if I use fulldate1, but then I lose the group_by(ip) result, which is necessary. Is it possible to make lag reference its own input within dplyr?

注意:如果可能的话,我真的更希望使用dplyr而不是data.table,因为我一直在使用dplyr作为我们的主要数据处理库,而且因为我想向Wickham先生建议,如果在现有的dplyr库中确实没有解决方案,他会考虑这一点.

Note: I really would prefer an answer using dplyr and not data.table, if possible, since I've been using dplyr as our primary data munging library, but also since I'd like to suggest to Mr. Wickham that he take this under consideration if it truly has no solution in the existing dplyr library.

推荐答案

在OP的代码中...

In the OP's code ...

...
d) group_by(ip) %>%
e) mutate(shifted = dplyr::lag(fulldate, default=fulldate)) %>%
...

default=参数的长度应为1.在这种情况下,用default = first(fulldate)替换OP的代码应该可行(因为第一个元素不会有滞后,因此我们需要在其中应用默认值).

The default= argument should have a length of one. Replacing the OP's code with default = first(fulldate) should work in this case (since the first element won't have a lag and so is where we need to apply the default value).

相关案例:

  • 类似地,我们希望使用线索" dplyr::lead(x, default=last(x)).
  • 滞后或超前超过一个步骤(n大于1),default=无法做到这一点,我们可能需要切换到if_elsecase_when或类似名称. (我不确定当前的tidyverse成语.)
  • Similarly, with a "lead", we'd want dplyr::lead(x, default=last(x)).
  • With a lag or lead of more than one step (n greater than 1), default= cannot do it and we'd probably need to switch to if_else or case_when or similar. (I'm not sure about the current tidyverse idiom.)

这篇关于您可以将dplyr :: mutate和dplyr :: lag设为默认值=它自己的输入值吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆