如何按类别在data.table列中查找(而不是替换)领先的NA,差距和最终NA [英] How to find (not replace) leading NAs, gaps, and final NAs in data.table columns by category
问题描述
我正在尝试了解面板数据集中缺失的类型。我认为可能有以下三种情况:
I'm trying to get an idea about the type of missings in my panel dataset. I think there can be three cases:
- 领先NA's ;在某个人的数据开始之前
- 差距;因此,在重新启动数据后的一段时间内丢失数据
- NA在末尾 strong>;如果一个人提前停止数据,
- leading NA's; before data starts for a certain individual
- gaps; so missing data for a couple of time periods after which data restarts
- NA's at the end; of the data if an individual stops early
我不是在寻找直接更改或填写它们的函数。在我对问题有一个想法之后,想决定如何处理他们。
I'm not looking for functions that directly change them or fill them in. Instead, I want to decide what to do with them after I have an idea of the problem.
如何摆脱领先的NA(而不是如何看待你有多少)已在
How to get rid of leading NA's (but not how to see how many you have) is solved here. Addressing all NA's is straightforward:
library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA]
# in Petal lenth setosa has lead NA's, versicolor a gap, virginica NA's at the end
Data[is.na(Petal.Length)] # this is a mix of all three types of NA's
的所有三种类型的混合,但我想区分这三种情况。理想情况下,我想直接在data.table中解决它们,例如
But I want to differentiate the three cases. Ideally, I'd like to address them directly in data.table as
- 给我一个数据表,其中所有观测值均具有领先的NA在Petal.Length中
- 给我一个数据表,其中包含观察到的数据在Petal.Length中的缺口
- 给我一个数据表,其中包含NA是每个人在最后一个时间段内的观察结果。
对于铅NA来说,我仍然可以完成任务,但感觉非常笨拙:
For lead NA's I can still get it done but it feels super clumsy:
Data[!is.na(Petal.Length), firstobs := ifelse(min(time) == time, 1, 0), by = Species]
Data[, mintime := max(firstobs * time, na.rm = T), by = Species]
Data[time < mintime]
我想可以用max和Lead来完成最后一个NA的操作,但是我无法得到我的脑袋绕着缝隙,那对我来说是最重要的。我在网上找到的解决方案通常直接填写,删除或移动这些NA,我只想看看。
I guess something similar could be done with max and leads for the last NA's but I cant get my head around gaps and those are the most important ones for me. The solutions I found online usually directly fill in, delete or shift these NA's, I just want to have a look.
所需的输出为:
领先的NA:
Data[1:5]
空白:
Data[60:65]
结尾处不适用:
Data[145:150]
但是我想想要通过检查三种类型的NA的位置来获取这些NA,因为我的实际数据集太大了,可以手动进行检查。
But I'd like to get these by checking where the three types of NA's are as my actual dataset is to large to check this manually.
编辑:我应该将其添加到我的真实数据集中,我不知道每个人何时开始报告数据。因此:
edit: I should add that in my real dataset, I don't know when every individual starts reporting data. So:
Data[is.na(Petal.Length), time, by= Species]
不会帮我。
推荐答案
一种方法:
Data[, g := {
r = rleid(vna <- is.na(Petal.Length))
if (first(vna)) r = replace(r, r == 1L, 0L)
if ( last(vna)) r = replace(r, r == last(r), 9999L)
replace(r, !vna, NA_integer_)
}, by=Species]
确认它与OP期望的行相匹配...
Confirming that it matches the rows expected by the OP...
> # leading
> Data[g == 0L, which = TRUE]
[1] 1 2 3 4 5
> # trailing
> Data[g == 9999L, which = TRUE]
[1] 145 146 147 148 149 150
> # gaps
> Data[!.(c(0L, 9999L, NA_integer_)), on="g", which = TRUE]
[1] 60 61 62 63 64 65
要仅获取子集,请使用不带 which = TRUE
参数的这些命令。
To just take the subset, use these commands without the which = TRUE
argument.
除了识别三个类别中的每行之外,该方法还通过不同的 g
值(如果有多个)来识别差距。
Beyond just identifying the rows in each of the three categories, this approach also identifies gaps via distinct g
values if there are multiple.
工作原理
您可以插入一些 print
和 cat
指令来遵循循环中每个对象的外观:
You can insert some print
and cat
instructions to follow what each object looks like during the loop:
csprintf <- function(s, ...) cat(sprintf(s, ...))
Data[, g := {
csprintf("Group: %s = %s %s\n", toString(names(.BY)), toString(.BY), strrep("*", 60))
r = rleid(vna <- is.na(Petal.Length))
csprintf("NA positions and initial grouping vector:\n")
print(data.table(Petal.Length, r, vna))
if (first(vna)) r = replace(r, r == 1L, 0L)
csprintf("NA positions and grouping vector after tagging leading NAs:\n")
print(data.table(Petal.Length, r, vna))
if ( last(vna)) r = replace(r, r == last(r), 9999L)
csprintf("NA positions and grouping vector after tagging trailing NAs:\n")
print(data.table(Petal.Length, r, vna))
r = replace(r, !vna, NA_integer_)
csprintf("NA positions and grouping vector after tagging non-NAs:\n")
print(data.table(Petal.Length, r, vna))
cat(strrep("\n", 2))
r
}, by=Species]
非常多,它创建了vna表示NA位置的向量和分组的r向量在vna中运行。然后它将特殊代码分配给某些特定的运行,这些运行以后可用于过滤。
Pretty much, it creates the vna vector indicating NA positions and the r vector that groups runs in vna. Then it assigns special codes to some certain runs that can later be used for filtering.
这篇关于如何按类别在data.table列中查找(而不是替换)领先的NA,差距和最终NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!