向data.frame中添加准确比例的随机缺失值 [英] add exact proportion of random missing values to data.frame

查看:93
本文介绍了向data.frame中添加准确比例的随机缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中的data.frame中添加随机NA.到目前为止,我已经研究了以下问题:

I would like to add random NA to a data.frame in R. So far I've looked into these questions:

R:将NA随机地按比例插入数据帧

如何将随机NA添加到数据框中

How do I add random NAs into a data frame

将随机缺失值添加到完整的数据框中(在R中)

这里提供了许多解决方案,但我找不到符合以下5个条件的解决方案:

Many solutions were provided here, but I couldn't find one that comply with these 5 conditions:

  • 添加真正随机的NA,并且按行或按列添加的金额不相同
  • 使用data.frame中可能遇到的每类变量(数字,字符,因子,逻辑,ts ..),因此输出必须与输入data.frame或矩阵具有相同的格式. /li>
  • 保证输出中NA的确切数量或比例 [note] (许多解决方案产生的NA数量较少,因为在同一位置生成了许多NA)
  • 对于大型数据集,计算效率很高.
  • 添加NA的比例/数量,与输入中已经存在的NA无关.
  • Add really random NA, and not the same amount by row or by column
  • Work with every class of variable that one can encounter in a data.frame (numeric, character, factor, logical, ts..), so the output must have the same format as the input data.frame or matrix.
  • Guarantee an exact number or proportion [note] of NA in the output (many solutions result in a smaller number of NA since several are generated at the same place)
  • Is computationnaly efficient for big datasets.
  • Add the proportion/number of NA independently of already present NA in the input.

有人有主意吗? 我已经尝试编写一个函数来执行此操作(在第一个链接的答案中),但它不符合N°3& 4点. 谢谢.

Anyone has an idea? I have already tried to write a function to do this (in an answer of the first link) but it doesn't comply with points N°3&4. Thanks.

[note]确切的比例,当然以+/- 1NA取整.

推荐答案

这是我针对library(imputeMulti)上的论文的方法,该方法目前正在JSS上进行审查.这样会将NA插入整个数据集中的随机百分比中,并且可以很好地缩放.由于n * p * pctNA %% 1 != 0的情况,它不能保证一个精确的数字.

This is the way that I do it for my paper on library(imputeMulti) which is currently in review at JSS. This inserts NA's into a random percentage of the whole dataset and scales well, It doesn't guarantee an exact number because of the case of n * p * pctNA %% 1 != 0.

createNAs <- function (x, pctNA = 0.1) {
  n <- nrow(x)
  p <- ncol(x)
  NAloc <- rep(FALSE, n * p)
  NAloc[sample.int(n * p, floor(n * p * pctNA))] <- TRUE
  x[matrix(NAloc, nrow = n, ncol = p)] <- NA
  return(x)
}

很明显,您应该使用随机种子来提高可重复性,可以在函数调用之前指定该种子.

Obviously you should use a random seed for reproducibility, which can be specified before the function call.

这是创建基线数据集以便在插补方法之间进行比较的一般策略.我相信这是您想要的,尽管您的问题(如评论中所述)并未明确说明.

This works as a general strategy for creating baseline datasets for comparison across imputation methods. I believe this is what you want, although your question (as noted in the comments) isn't clearly stated.

编辑:我确实认为x已完成.因此,我不确定它将如何处理现有的丢失数据.您当然可以根据需要修改代码,尽管这可能会使运行时间至少增加O(n * p)

Edit: I do assume that x is complete. So, I'm not sure how it would handle existing missing data. You could certainly modify the code if you want, though that would probably increase the runtime by at least O(n*p)

这篇关于向data.frame中添加准确比例的随机缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆