处理R中丢失/不完整的数据-是否可以屏蔽而不删除NA? [英] Handling missing/incomplete data in R--is there function to mask but not remove NAs?

查看:76
本文介绍了处理R中丢失/不完整的数据-是否可以屏蔽而不删除NA?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如您希望从用于数据分析的DSL中所期望的那样,R可以很好地处理丢失/不完整的数据,例如:

As you would expect from a DSL aimed at data analysis, R handles missing/incomplete data very well, for instance:

许多R函数都有一个 na.rm 标志,当设置为 TRUE 时,删除NA:

Many R functions have an na.rm flag that when set to TRUE, remove the NAs:

>>> v = mean( c(5, NA, 6, 12, NA, 87, 9, NA, 43, 67), na.rm=T)
>>> v
      (5, 6, 12, 87, 9, 43, 67)

但是,如果要在函数调用之前 处理NA,则需要执行以下操作:

But if you want to deal with NAs before the function call, you need to do something like this:

要从向量中删除每个"NA":

to remove each 'NA' from a vector:

vx = vx[!is.na(a)]

从向量中删除每个"NA"并替换为"0":

to remove each 'NA' from a vector and replace it w/ a '0':

ifelse(is.na(vx), 0, vx)

要从数据框中删除包含"NA"的每一行:

to remove entire each row that contains 'NA' from a data frame:

dfx = dfx[complete.cases(dfx),]

所有这些功能都会永久删除 'NA'或其中包含'NA'的行.

All of these functions permanently remove 'NA' or rows with an 'NA' in them.

有时候这并不是您想要的-在工作流的下一步中,有必要为数据帧制作一个'NA'切除的副本,但是在随后的步骤中,您通常希望返回这些行(例如,计算因先前调用完整案例"而导致行缺失而该行中没有"NA"值的列的按列统计信息.

Sometimes this isn't quite what you want though--making an 'NA'-excised copy of the data frame might be necessary for the next step in the workflow but in subsequent steps you often want those rows back (e.g., to calculate a column-wise statistic for a column that has missing rows caused by a prior call to 'complete cases' yet that column has no 'NA' values in it).

尽可能清楚我要寻找的内容:python/numpy有一个 masked array 类,带有一个 mask 方法,它使您<在调用函数的过程中,strong> conceal (但不能删除)NA. R中有类似的功能吗?

to be as clear as possible about what i'm looking for: python/numpy has a class, masked array, with a mask method, which lets you conceal--but not remove--NAs during a function call. Is there an analogous function in R?

推荐答案

确切地处理丢失的数据-如果我们知道丢失的数据,则可能将其标记为NA-可能因域而异.

Exactly what to do with missing data -- which may be flagged as NA if we know it is missing -- may well differ from domain to domain.

举一个与时间序列有关的示例,您可能希望跳过,填充,或内插或以不同的方式进行内插,...是 just (非常有用和流行) zoo 具有与NA处理相关的所有这些功能:

To take an example related to time series, where you may want to skip, or fill, or interpolate, or interpolate differently, ... is that just the (very useful and popular) zoo has all these functions related to NA handling:

zoo::na.approx  zoo::na.locf    
zoo::na.spline  zoo::na.trim    

允许近似值(使用不同的算法),向前或向后结转,使用样条插值或修整.

allowing to approximate (using different algorithms), carry-forward or backward, use spline interpolation or trim.

另一个例子是CRAN上大量缺少的插补包-通常提供特定于域的解决方案. [因此,如果您将R称为DSL,这是什么? 特定于域的语言的特定于子域​​的解决方案"还是SDSSFDSL?满口的:)]

Another example would be the numerous missing imputation packages on CRAN -- often providing domain-specific solutions. [ So if you call R a DSL, what is this? "Sub-domain specific solutions for domain specific languages" or SDSSFDSL? Quite a mouthful :) ]

但是对于您的特定问题:不,我不知道基数R中的位级别标志,该标志使您可以将观察结果标记为要排除".我猜想大多数R用户会求助于na.omit()等功能,或者使用您提到的na.rm=TRUE选项.

But for your specific question: no, I am not aware of a bit-level flag in base R that allows you to mark observations as 'to be excluded'. I presume most R users would resort to functions like na.omit() et al or use the na.rm=TRUE option you mentioned.

这篇关于处理R中丢失/不完整的数据-是否可以屏蔽而不删除NA?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆