对每列包含大量缺失数据的单行进行抽样 [英] Sample a single row, per column, with substantial missing data

查看:146
本文介绍了对每列包含大量缺失数据的单行进行抽样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我的数据框架的一个例子,我将调用 df1 ,我有GROUP1有三行数据,GROUP2有两行数据。我有三个变量X1,X2和X3:

 组X1 X2 X3 
GROUP1 A NA NA
GROUP1 NA NA T
GROUP1 CTG
GROUP2 NA NA C
GROUP2 G NA T

根据以前的问题和答案,我的回答有一半(在以下条件下对R中的数据帧的子集内的每列采样单行:)除非我在使用字符时遇到问题。



我想从GROUP1的每列中抽取一个变量,创建一个表示GROUP1的新行。我不想从GROUP1采样一个单一的完整行,而是需要为每个列单独进行采样。我想对GROUP2也这样做。此外,抽样不应考虑/包括NA,除非该组变量的所有行都具有NA(例如上面的GROUP2,变量X2)。



我可以得到结果:

 组X1 X2 X3 
GROUP1 ATT
GROUP2 G NA C

只有GROUP2,变量X2,



当我使用时:



我的实际有300个分类群,40个群组,160000个变量和大量的NA。

  library(data.table)

setDT(df1)[,lapply(.SD,function(x)
if .na(x)))NA_character_ else sample(na.omit(x),1)),by = GROUP]

我最终收到一个警告:

 第2组的结果第2列是'character'键入
'integer'。列类型必须与每个组一致。

但是,此警告似乎不适用于仅由NA组成的组的那些变量。 / p>

如果我改为用NA_integer_替换NA_character_,一些列会导致组变量的非NA行的总和,而不是行的样本。

解决方案

您可以使用 data.table 调用:

  setDT(df1)[,lapply(.SD,
function(x)x [!is.na(x)] [sample (!is.na(x)),1)]),by = GROUP]

可以调整你原来的一个

  setDT(df1)[,lapply(.SD,function(x)
if all(is.na(x)))NA_character_
else as.character(na.omit(x))[sample(length(na.omit(x)),1)]),by = GROUP]

或使用 aggregate p>

  aggregate(df1 [,names(df1)!=GROUP],by = list(df1 $ GROUP),
函数(ii)ifelse(length(na.omit(ii))== 0,
NA,
as.character(na.omit(ii))[sample(length(na.omit ii)),1)]))
#注意在因子为
时使用as.character#1组X1 X1 X3
#1 GROUP1 ATT
#2 GROUP2 G NA< C

正如thelatemail所提到的,你遇到的问题很可能是由于变量 factor s,因为当X1-X3是字符时你的代码工作。任何上述解决方案都应该使用因素


As an example of my data frame, which I will call df1, I have GROUP1 with three rows of data, and GROUP2 with two rows of data. I have three variables, X1, X2, and X3:

GROUP          X1    X2   X3
GROUP1         A     NA   NA
GROUP1         NA    NA   T
GROUP1         C     T    G   
GROUP2         NA    NA   C
GROUP2         G     NA   T

I am halfway to my answer, based on a previous question and answer (Sample a single row, per column, within a subset of a data frame in R, while following conditions) except I am having issues using characters.

I would like to sample a single variable, per column from GROUP1, to make a new row representing GROUP1. I do not want to sample one single and complete row from GROUP1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP2, variable X2, above).

For example, after sampling, I could have as a result:

GROUP         X1    X2   X3
GROUP1        A     T    T
GROUP2        G     NA   C

Only GROUP2, variable X2, can result in NA here. I actually have 300 taxa, 40 groups, 160000 variables, and a substantial number of NA's.

When I use:

library(data.table)

setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_character_ else sample(na.omit(x),1)) , by = GROUP]

I end up with a warning:

Column 2 of result for group 2 is type 'character' but expecting type    
'integer'. Column types must be consistent for each group.

However, this warning does not seem to apply to only those variables of groups composed entirely of NA's.

If I instead replace NA_character_ with NA_integer_, some columns result in the sum of non-NA rows for the group's variable, rather a sample from across the rows.

解决方案

You can use this data.table call:

setDT(df1)[ , lapply(.SD, 
  function(x) x[!is.na(x)][sample(sum(!is.na(x)), 1)]), by = GROUP]

Or you can tweak your original one

setDT(df1)[,lapply(.SD, function(x)
  if(all(is.na(x))) NA_character_ 
    else as.character(na.omit(x))[sample(length(na.omit(x)), 1)]) , by = GROUP]

Or using aggregate from base R:

aggregate(df1[ , names(df1) != "GROUP"], by=list(df1$GROUP), 
  function(ii) ifelse(length(na.omit(ii)) == 0, 
    NA,
    as.character(na.omit(ii))[sample(length(na.omit(ii)), 1)])) 
    # Note use of as.character in case of factors
#  Group.1 X1   X2 X3
#1  GROUP1  A    T  T
#2  GROUP2  G <NA>  C

As thelatemail mentioned, the issue you are encountering is most likely due to variables being factors, as your code works when X1-X3 are characters. Any of the above solutions should work with factors.

这篇关于对每列包含大量缺失数据的单行进行抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆