用dplyr填充缺少的序列值 [英] Fill missing sequence values with dplyr

查看:89
本文介绍了用dplyr填充缺少的序列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个缺少 SNAP_ID值的数据框。我想根据前一个非缺失值(lag()?)的序列,用浮点值填充缺失值。如果可能的话,我真的很想仅使用dplyr实现此目的。

I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.

假设:


  1. 第一个或最后一个都不会丢失数据我正在根据数据集的最小值和最大值之间的缺失天数来生成缺失日期

  2. 数据集中可能存在多个空白

当前数据:

                  end SNAP_ID
1 2015-06-26 12:59:00     365
2 2015-06-26 13:59:00     366
3 2015-06-27 00:01:00      NA
4 2015-06-27 23:00:00      NA
5 2015-06-28 00:01:00      NA
6 2015-06-28 23:00:00      NA
7 2015-06-29 09:00:00     367
8 2015-06-29 09:59:00     368

我想要实现的目标:

                  end SNAP_ID
1 2015-06-26 12:59:00     365.0
2 2015-06-26 13:59:00     366.0
3 2015-06-27 00:01:00     366.1
4 2015-06-27 23:00:00     366.2
5 2015-06-28 00:01:00     366.3
6 2015-06-28 23:00:00     366.4
7 2015-06-29 09:00:00     367.0
8 2015-06-29 09:59:00     368.0

作为数据框:

df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260, 
    1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct", 
    "POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end", 
    "SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")

这是我为实现这一目标所做的尝试,但这仅适用于第一个缺失的值:

This was my attempt at achieving this goal, but it only works for the first missing value:

df %>% 
  arrange(end) %>%
  mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))

                  end SNAP_ID
1 2015-06-26 12:59:00   365.0
2 2015-06-26 13:59:00   366.0
3 2015-06-27 00:01:00   366.1
4 2015-06-27 23:00:00      NA
5 2015-06-28 00:01:00      NA
6 2015-06-28 23:00:00      NA
7 2015-06-29 09:00:00   367.0
8 2015-06-29 09:59:00   368.0

来自@ mathematical.coffee的出色答案如下:

The outstanding answer from @mathematical.coffee below:

df %>% 
  arrange(end) %>%
  group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
  mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
  ungroup() %>%
  select(-tmp)


推荐答案

编辑:新版本适用于任意数量的NA运行。
这个人也不需要 zoo

new version works for any number of NA runs. This one doesn't need zoo, either.

首先,请注意 tmp = cumsum(!is.na(SNAP_ID)) SNAP_ID 分组为相同的 tmp 由一个非NA值和一系列NA值组成。

First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.

然后按此变量分组,然后在第一个变量前加上.1 SNAP_ID填写NA:

Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:

df %>% 
  arrange(end) %>%
  group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
  mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))

                  end SNAP_ID tmp
1 2015-06-26 12:59:00   365.0   1
2 2015-06-26 13:59:00   366.0   2
3 2015-06-27 00:01:00   366.1   2
4 2015-06-27 23:00:00   366.2   2
5 2015-06-28 00:01:00   366.3   2
6 2015-06-28 23:00:00   366.4   2
7 2015-06-29 09:00:00   367.0   3
8 2015-06-29 09:59:00   368.0   4

然后您可以删除 tmp 列( dd %>%select(-tmp)到结尾)。

Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).

编辑:这是旧版本,对于以后运行 NA s

this is the old version which doesn't work for subsequent runs of NAs.

如果您的目标是用先前的值+ 0.1填充每个NA,则可以使用 zoo na。 locf (使用先前的值填充每个 NA )以及 cumsum(is.na(SNAP_ID))* 0.1 添加额外的0.1。

If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.

library(zoo)
df %>% 
  arrange(end) %>%
  mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
                       na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
                       SNAP_ID))

这篇关于用dplyr填充缺少的序列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆