如何从R data.table中删除NA块 [英] How should I drop blocks of NAs from an R data.table

查看:363
本文介绍了如何从R data.table中删除NA块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的R data.table 与多列键,其中一些值列包含一些NA。我想删除一个或多个值列中完全为NA的组,但保留整个组。



为了给出一个简化的例子:

  library(data.table)
DT = data.table(
Series = rep(letters [1:12],each = 3),
Id =
Value1 = c(1:3,NA,5:9,rep(NA,3),1:3,NA,5:9,rep(NA,3),1:3,NA, 9,rep(NA,3)),
Value2 = c(rep(NA,3),1:4,NA,6: rep(NA,3)))
DT
系列Id值1 Value2
1:a 1 1 NA
2:a 2 2 NA
3:a 3 3 NA
4:b 4 NA 1
5:b 5 5 2
6:b 6 6 3
7:c 7 7 4
8:c 8 8 NA
9:c 9 9 6
10:d 10 NA 7
11:d 11 NA 8
12:d 12 NA 9
13:e 1 1 NA
14:e 2 2 NA
15:e 3 3 NA
16:f 4 NA 1
17:f 5 5 2
18:f 6 6 3
19:g 7 7 4
20:g 8 8 5
21:g 9 9 6
22:h 10 NA 7
23:h 11 NA 8
24:h 12 NA 9
25:i 1 1 1
26:i 2 2 2
27:i 3 3 3
28:j 4 NA 4
29:j 5 5 5
30:j 6 6 6
31:k 7 7 7
32:k 8 8 8
33:k 9 9 9
34:l 10 NA NA
35:l 11 NA NA
36:l 12 NA NA
系列Id值1 Value2



所以我想放弃:




  • ,d,e,h和l

  • ID:4,10,11和12



正确的结果应如下所示:

 系列ID值1 Value2 
1:b 5 5 2
2 :b 6 6 3
3:c 7 7 4
4:c 8 8 NA
5:c 9 9 6
6:f 5 5 2
7 :f 6 6 3
8:g 7 7 4
9:g 8 8 5
10:g 9 9 6
11:i 1 1 1
12 :i 2 2 2
13:i 3 3 3
14:j 5 5 5
15:j 6 6 6
16:k 7 7 7
17 :k 8 8 8
18:k 9 9 9
系列Id值1 Value2



< hr>

到目前为止我管理的是什么:



我可以找到Value1的NA系列, >

  DT [,sum(1-is.na(Value1))== 0,by = Series] [V1 == TRUE] 

我甚至可以做

  setkey(DT,Series)
DT = DT [DT [,sum(1-is.na(Value))== 0,by = Series] [V1!= TRUE]

但现在我最终以V1出现在决赛桌。

解决方案

您可以这样做,以获取那些条目,其中不是所有 Value 都是 NA

  setkey(DT,Series)
DT [,.SD [ (!all(is.na(Value)))],by = Series]

需要使用!all 来避免Matthew将查看的非联接语法(见注释)。与此相同:

  DT [,.SD [as.logical(!all(is.na(Value))) ,by = Series] 






新的澄清问题:

  allNA = function(x)all(is.na(x))#define辅助函数
for(i in c(Id,Series))
DT = DT [,if(!any(sapply(.SD,allNA))).SD else NULL,by = i]
DT
系列Id值1值2
1:i 1 1 1
2:i 2 2 2
3:i 3 3 3
4:b 5 5 2
5:b 6 6 3
6:f 5 5 2
7:f 6 6 3
8:j 5 5 5
9:j 6 6 6
10:c 7 7 4
11:c 8 8 NA
12:c 9 9 6
13:g 7 7 4
14:g 8 8 5
15:g 9 9 6
16:k 7 7 7
17:k 8 8 8
18:k 9 9 9

这会改变顺序。所以不是正是要求的结果。

 #从原来的DT开始再次出现问题
DT [ ,drop:= FALSE]
for(i in c(Series,Id))
DT [,drop:= drop | any(sapply(.SD,allNA) i]
DT [(!drop)] [,drop:= NULL] []
系列Id值1 Value2
1:b 5 5 2
2:b 6 6 3
3:c 7 7 4
4:c 8 8 NA
5:c 9 9 6
6:f 5 5 2
7:f 6 6 3
8:g 7 7 4
9:g 8 8 5
10:g 9 9 6
11:i 1 1 1
12:i 2 2 2
13:i 3 3 3
14:j 5 5 5
15:j 6 6 6
16:k 7 7 7
17:k 8 8 8
18:k 9 9 9


I have a large R data.table with a multi column key, where some value columns contain some NA. I'd like to remove groups that are entirely NA in one or more value columns, but keep the whole group otherwise. Repeating this for each column of the key.

To give a simplified example:

library(data.table)
DT = data.table(
    Series = rep(letters[1:12], each = 3), 
    Id = 1:12,
    Value1 = c(1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3), 1:3, NA, 5:9, rep(NA,3)), 
    Value2 = c(rep(NA,3), 1:4, NA, 6:9, rep(NA,3), 1:9, 1:9, rep(NA,3)))
DT
    Series Id Value1 Value2
 1:      a  1      1     NA
 2:      a  2      2     NA
 3:      a  3      3     NA
 4:      b  4     NA      1
 5:      b  5      5      2
 6:      b  6      6      3
 7:      c  7      7      4
 8:      c  8      8     NA
 9:      c  9      9      6
10:      d 10     NA      7
11:      d 11     NA      8
12:      d 12     NA      9
13:      e  1      1     NA
14:      e  2      2     NA
15:      e  3      3     NA
16:      f  4     NA      1
17:      f  5      5      2
18:      f  6      6      3
19:      g  7      7      4
20:      g  8      8      5
21:      g  9      9      6
22:      h 10     NA      7
23:      h 11     NA      8
24:      h 12     NA      9
25:      i  1      1      1
26:      i  2      2      2
27:      i  3      3      3
28:      j  4     NA      4
29:      j  5      5      5
30:      j  6      6      6
31:      k  7      7      7
32:      k  8      8      8
33:      k  9      9      9
34:      l 10     NA     NA
35:      l 11     NA     NA
36:      l 12     NA     NA
    Series Id Value1 Value2

So I would like to drop:

  • Series: a,d,e,h and l
  • Ids: 4, 10,11 and 12

Correct result should look like:

    Series Id Value1 Value2
 1:      b  5      5      2
 2:      b  6      6      3
 3:      c  7      7      4
 4:      c  8      8     NA
 5:      c  9      9      6
 6:      f  5      5      2
 7:      f  6      6      3
 8:      g  7      7      4
 9:      g  8      8      5
10:      g  9      9      6
11:      i  1      1      1
12:      i  2      2      2
13:      i  3      3      3
14:      j  5      5      5
15:      j  6      6      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9
    Series Id Value1 Value2


What I managed so far:

I can find the Series that are NA for Value1 like this:

DT[, sum(1-is.na(Value1)) == 0, by = Series][V1 == TRUE]

And I could even do

setkey(DT, Series)
DT = DT[DT[, sum(1-is.na(Value)) == 0, by = Series][V1 != TRUE]]

But now I end up with V1 appearing in the final table.

解决方案

You can do this to get those entries where not ALL Value are NA:

setkey(DT, "Series")
DT[, .SD[(!all(is.na(Value)))], by=Series]

The parens around !all are needed to avoid not-join syntax which Matthew will look into (see comments). Same as this :

DT[, .SD[as.logical(!all(is.na(Value)))], by=Series]


Building on that to answer the new clarified question :

allNA = function(x) all(is.na(x))     # define helper function
for (i in c("Id","Series"))
    DT = DT[, if (!any(sapply(.SD,allNA))) .SD else NULL, by=i]
DT
    Series Id Value1 Value2
 1:      i  1      1      1
 2:      i  2      2      2
 3:      i  3      3      3
 4:      b  5      5      2
 5:      b  6      6      3
 6:      f  5      5      2
 7:      f  6      6      3
 8:      j  5      5      5
 9:      j  6      6      6
10:      c  7      7      4
11:      c  8      8     NA
12:      c  9      9      6
13:      g  7      7      4
14:      g  8      8      5
15:      g  9      9      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9

That changes the order, though. So isn't precisely the result requested. The following keeps the order and should be faster too.

# starting fresh from original DT in question again
DT[,drop:=FALSE]
for (i in c("Series","Id"))
    DT[,drop:=drop|any(sapply(.SD,allNA)),by=i]
DT[(!drop)][,drop:=NULL][]
    Series Id Value1 Value2
 1:      b  5      5      2
 2:      b  6      6      3
 3:      c  7      7      4
 4:      c  8      8     NA
 5:      c  9      9      6
 6:      f  5      5      2
 7:      f  6      6      3
 8:      g  7      7      4
 9:      g  8      8      5
10:      g  9      9      6
11:      i  1      1      1
12:      i  2      2      2
13:      i  3      3      3
14:      j  5      5      5
15:      j  6      6      6
16:      k  7      7      7
17:      k  8      8      8
18:      k  9      9      9

这篇关于如何从R data.table中删除NA块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆