如何按类别在data.table列中查找(而不是替换)领先的NA,差距和最终NA [英] How to find (not replace) leading NAs, gaps, and final NAs in data.table columns by category

查看:81
本文介绍了如何按类别在data.table列中查找(而不是替换)领先的NA,差距和最终NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解面板数据集中缺失的类型。我认为可能有以下三种情况:

I'm trying to get an idea about the type of missings in my panel dataset. I think there can be three cases:


  1. 领先NA's ;在某个人的数据开始之前

  2. 差距;因此,在重新启动数据后的一段时间内丢失数据

  3. NA在末尾;如果一个人提前停止数据,

  1. leading NA's; before data starts for a certain individual
  2. gaps; so missing data for a couple of time periods after which data restarts
  3. NA's at the end; of the data if an individual stops early

我不是在寻找直接更改或填写它们的函数。在我对问题有一个想法之后,想决定如何处理他们。

I'm not looking for functions that directly change them or fill them in. Instead, I want to decide what to do with them after I have an idea of the problem.

如何摆脱领先的NA(而不是如何看待你有多少)已在此处解决。解决所有不适用项很简单:

How to get rid of leading NA's (but not how to see how many you have) is solved here. Addressing all NA's is straightforward:

library(data.table)
Data <- as.data.table(iris)[,.(Species,Petal.Length)]
Data[, time := rep(1951:2000,3)]
Data[c(1:5,60:65,145:150), Petal.Length := NA]
# in Petal lenth setosa has lead NA's, versicolor a gap, virginica NA's at the end

Data[is.na(Petal.Length)] # this is a mix of all three types of NA's 

的所有三种类型的混合,但我想区分这三种情况。理想情况下,我想直接在data.table中解决它们,例如

But I want to differentiate the three cases. Ideally, I'd like to address them directly in data.table as


  1. 给我一个数据表,其中所有观测值均具有领先的NA在Petal.Length中

  2. 给我一个数据表,其中包含观察到的数据在Petal.Length中的缺口

  3. 给我一个数据表,其中包含NA是每个人在最后一个时间段内的观察结果。

对于铅NA来说,我仍然可以完成任务,但感觉非常笨拙:

For lead NA's I can still get it done but it feels super clumsy:

Data[!is.na(Petal.Length), firstobs := ifelse(min(time) == time, 1, 0), by = Species]
Data[, mintime := max(firstobs * time, na.rm = T), by = Species]
Data[time < mintime]

我想可以用max和Lead来完成最后一个NA的操作,但是我无法得到我的脑袋绕着缝隙,那对我来说是最重要的。我在网上找到的解决方案通常直接填写,删除或移动这些NA,我只想看看。

I guess something similar could be done with max and leads for the last NA's but I cant get my head around gaps and those are the most important ones for me. The solutions I found online usually directly fill in, delete or shift these NA's, I just want to have a look.

所需的输出为:

领先的NA:

Data[1:5]

空白:

Data[60:65]

结尾处不适用:

Data[145:150]

但是我想想要通过检查三种类型的NA的位置来获取这些NA,因为我的实际数据集太大了,可以手动进行检查。

But I'd like to get these by checking where the three types of NA's are as my actual dataset is to large to check this manually.

编辑:我应该将其添加到我的真实数据集中,我不知道每个人何时开始报告数据。因此:

edit: I should add that in my real dataset, I don't know when every individual starts reporting data. So:

Data[is.na(Petal.Length), time, by= Species]

不会帮我。

推荐答案

一种方法:

Data[, g := {
  r = rleid(vna <- is.na(Petal.Length))
  if (first(vna)) r = replace(r, r == 1L, 0L)
  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  replace(r, !vna, NA_integer_)
}, by=Species]

确认它与OP期望的行相匹配...

Confirming that it matches the rows expected by the OP...

> # leading
> Data[g == 0L, which = TRUE]
[1] 1 2 3 4 5
> # trailing
> Data[g == 9999L, which = TRUE]
[1] 145 146 147 148 149 150
> # gaps
> Data[!.(c(0L, 9999L, NA_integer_)), on="g", which = TRUE]
[1] 60 61 62 63 64 65

要仅获取子集,请使用不带 which = TRUE 参数的这些命令。

To just take the subset, use these commands without the which = TRUE argument.

除了识别三个类别中的每行之外,该方法还通过不同的 g 值(如果有多个)来识别差距。

Beyond just identifying the rows in each of the three categories, this approach also identifies gaps via distinct g values if there are multiple.

工作原理

您可以插入一些 print cat 指令来遵循循环中每个对象的外观:

You can insert some print and cat instructions to follow what each object looks like during the loop:

csprintf <- function(s, ...) cat(sprintf(s, ...))
Data[, g := {
  csprintf("Group: %s = %s %s\n", toString(names(.BY)), toString(.BY), strrep("*", 60))

  r = rleid(vna <- is.na(Petal.Length))
  csprintf("NA positions and initial grouping vector:\n")
  print(data.table(Petal.Length, r, vna))

  if (first(vna)) r = replace(r, r == 1L, 0L)
  csprintf("NA positions and grouping vector after tagging leading NAs:\n")
  print(data.table(Petal.Length, r, vna))

  if ( last(vna)) r = replace(r, r == last(r), 9999L)
  csprintf("NA positions and grouping vector after tagging trailing NAs:\n")
  print(data.table(Petal.Length, r, vna))

  r = replace(r, !vna, NA_integer_)
  csprintf("NA positions and grouping vector after tagging non-NAs:\n")
  print(data.table(Petal.Length, r, vna))

  cat(strrep("\n", 2))

  r
}, by=Species]

非常多,它创建了vna表示NA位置的向量和分组的r向量在vna中运行。然后它将特殊代码分配给某些特定的运行,这些运行以后可用于过滤。

Pretty much, it creates the vna vector indicating NA positions and the r vector that groups runs in vna. Then it assigns special codes to some certain runs that can later be used for filtering.

这篇关于如何按类别在data.table列中查找(而不是替换)领先的NA,差距和最终NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆