填充“隐含的缺失值".在每个时间单位具有不同观察值的数据帧中 [英] Filling "implied missing values" in a data frame that has varying observations per time unit

查看:79
本文介绍了填充“隐含的缺失值".在每个时间单位具有不同观察值的数据帧中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含时空数据的大型数据集.每组坐标都与一个ID(计算机游戏中的玩家ID)相关联.不幸的是,并不是每个时间单位都记录每个ID的坐标.如果在x时间戳处无法获得特定ID的读数,则该行将从数据集中完全省略,而不是记录为NA.

I have a large dataset with spatiotemporal data. Each set of coordinates are associated with an id (player id in a computer game). Unfortunately the coordinates for each id aren't logged at every time unit. If a reading is not available for a specific id at x time stamp, then that row was entirely omitted from the dataset rather than logged as NA.

我希望每个时间单位的观测值与唯一ID相同(即插入暗示的缺失NA").在缺少ID的时间单位上,应将它们插入为以NA为坐标的新行.

I would like to have the same exact amount of observations per time unit as there are unique ids (i.e. inserting "implied missing NAs"). On time units where ids are missing, they should be inserted as new rows with NAs as their coordinates.

下面是一个虚拟数据集,用于说明:

Here's a dummy dataset to illustrate:

time <- c(10,10,10,10,11,11,11,11,11,11,12,12,12,12,13,13,14,14,14,14,14,14,15,15,15)
id <- c(1,3,4,5,1,2,3,4,5,6,2,4,5,6,3,6,1,2,3,4,5,6,2,4,5)
x <- c(128,128,64,64,124,128,120,68,64,64,122,71,65,64,112,74,116,114,113,73,70,70,111,75,70)
y <- c(128,128,64,66,125,128,124,66,67,64,124,67,71,68,113,68,115,119,113,76,69,77,116,80,82)

spatiodf <- as.data.frame(cbind(time, id, x, y))


   time id   x   y
1    10  1 128 128
2    10  3 128 128
3    10  4  64  64
4    10  5  64  66
5    11  1 124 125
6    11  2 128 128
7    11  3 120 124
8    11  4  68  66
9    11  5  64  67
10   11  6  64  64
11   12  1 118 123
12   12  2 122 124
13   12  4  71  67
14   12  5  65  71
15   12  6  64  68
16   13  3 112 113
17   13  6  74  68
18   14  1 116 115
19   14  2 114 119
20   14  3 113 113
21   14  4  73  76
22   14  5  70  69
23   14  6  70  77
24   15  2 111 116
25   15  4  75  80
26   15  5  70  82

从上面的输出中,我想转到下面的输出,其中重新创建数据帧,每个时间单位具有相等的观察值(并且将NA值手动插入到缺少值的行中).

From the above output I would like to get to the following below output where the data frame was recreated with each time unit having an equal amount of observations (and NA values were manually inserted into rows that had missing values).

time <- rep(10:15, each = 6)
id <- rep(1:6, times = 6)
x <- c(128,NA,128,64,64,NA,124,128,120,68,64,64,NA,122,NA,71,65,64,NA,NA,112,NA,NA,74,116,114,113,73,70,70,NA,111,NA,75,70,NA)
y <- c(128,NA,128,64,66,NA,125,128,124,66,67,64,NA,124,NA,67,71,68,NA,NA,113,NA,NA,68,115,119,113,76,69,77,NA,116,NA,80,82,NA)

spatiodf_equal_obs <- as.data.frame(cbind(time, id, x, y))

library(dplyr)
spatiodf_equal_obs %>% 
  arrange(id)

   time id   x   y
1    10  1 128 128
2    11  1 124 125
3    12  1  NA  NA
4    13  1  NA  NA
5    14  1 116 115
6    15  1  NA  NA
7    10  2  NA  NA
8    11  2 128 128
9    12  2 122 124
10   13  2  NA  NA
11   14  2 114 119
12   15  2 111 116
13   10  3 128 128
14   11  3 120 124
15   12  3  NA  NA
16   13  3 112 113
17   14  3 113 113
18   15  3  NA  NA
19   10  4  64  64
20   11  4  68  66
21   12  4  71  67
22   13  4  NA  NA
23   14  4  73  76
24   15  4  75  80
25   10  5  64  66
26   11  5  64  67
27   12  5  65  71
28   13  5  NA  NA
29   14  5  70  69
30   15  5  70  82
31   10  6  NA  NA
32   11  6  64  64
33   12  6  64  68
34   13  6  74  68
35   14  6  70  77
36   15  6  NA  NA

数据必须采用上述格式的原因是因为我希望能够用相同ID的最接近的前一个或后一个条目来填充NA值.上面的输出中有了数据框后,可以使用tidyr的fill()完成操作:

The reason the data needs to be in the above format is because I want to be able to fill in the NA values with the nearest available previous or following entry from the same id. Once we have the dataframe in the above output that can be done using fill() from tidyr:

library(tidyr)
res <- spatiodf_equal_obs %>%
  group_by(id) %>%
  fill(x, y, .direction = "down") %>%
  fill(x, y, .direction = "up") 

我已经尝试了传播,收集(以及创建新数据帧以合并(df1,df2,all = TRUE)的技巧)的许多组合.我似乎无法弄清楚如何从第一个数据帧转到第二个数据帧.

I've tried a lot of combinations of spreading, gathering (and trickery with creating new dataframes to merge(df1, df2, all=TRUE)). I can't seem to figure out how to go from that first data frame to the second one though.

最终输出应如下所示:

   time id   x   y
1    10  1 128 128
2    11  1 124 125
3    12  1 124 125
4    13  1 124 125
5    14  1 116 115
6    15  1 116 115
7    10  2 128 128
8    11  2 128 128
9    12  2 122 124
10   13  2 122 124
11   14  2 114 119
12   15  2 111 116
13   10  3 128 128
14   11  3 120 124
15   12  3 120 124
16   13  3 112 113
17   14  3 113 113
18   15  3 113 113
19   10  4  64  64
20   11  4  68  66
21   12  4  71  67
22   13  4  71  67
23   14  4  73  76
24   15  4  75  80
25   10  5  64  66
26   11  5  64  67
27   12  5  65  71
28   13  5  65  71
29   14  5  70  69
30   15  5  70  82
31   10  6  64  64
32   11  6  64  64
33   12  6  64  68
34   13  6  74  68
35   14  6  70  77
36   15  6  70  77

推荐答案

要使用从最近的行中获取的值来填补空白,您可以执行以下操作:

To fill in gaps with values taken from the nearest row, you can do:

library(data.table)
setDT(spatiodf)

resDT = spatiodf[
  CJ(id = id, time = min(time):max(time), unique = TRUE), on=.(id, time), roll="nearest"
]

# verify
fsetequal(data.table(res), resDT) # TRUE

工作原理

  • setDT转换为适当的数据表,因此不需要<-.

  • setDT converts to a data.table in place, so no <- is needed.

DT[i, on=, roll=]使用i查找DT中的行,将每个i滚动到DT中的一行. 滚动"在on=的最后一列完成.

DT[i, on=, roll=] uses i to look up rows in DT, rolling each i to a row in DT. The "roll" is done on the final column in on=.

CJ(a, b, unique = TRUE)返回ab的所有组合,就像base中的expand.grid一样.

CJ(a, b, unique = TRUE) returns all combos of a and b, like expand.grid in base.

这篇关于填充“隐含的缺失值".在每个时间单位具有不同观察值的数据帧中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆