读取R中缺少值的文件 [英] Reading file with missing values in R

查看:86
本文介绍了读取R中缺少值的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件名为'fn',我正在阅读如下:

I have a file with filename = 'fn', which I am reading as follows:

age CALCIUM CREATININE  GLUCOSE
64.3573     1.1 488
69.9043 8.1 1.1 472
65.6633 8.6 0.8 461
50.3693 8.1 1.3 418
57.0334 8.7 0.8 NEG
81.4939     1.1 NEG
56.954  9.8 1   
76.9298 9.1 0.8 NEG


> tmpData = read.table(fn, header = TRUE,  sep= "\t" , na.strings = c('', 'NA', '<NA>'),  blank.lines.skip = TRUE)
> tmpData
      age CALCIUM CREATININE GLUCOSE
1 64.3573            NA        1.1     488
2 69.9043           8.1        1.1     472
3 65.6633           8.6        0.8     461
4 50.3693           8.1        1.3     418
5 57.0334           8.7        0.8     NEG
6 81.4939            NA        1.1     NEG
7 56.9540           9.8        1.0    <NA>
8 76.9298           9.1        0.8     NEG

该文件如上所示,缺少值替换为NA和& NA>。我认为'葡萄糖'柱被视为因素。有没有一个简单的方法来解释NA>作为真实NA,并将任何非数值转换为NA(在本例中,NEG为NA)

The file is read as above with missing values replaced as NA and < NA >. I guess that the 'glucose' column is treated as factor. Is there an easy way to interpret < NA > as real NA and convert any non-numeric values into NA (in this example NEG into NA)

推荐答案

事实上, as.numeric 将强制非数字值到 NA 。换句话说,尝试这样的事情:

You can take advantage of the fact that as.numeric will coerce non-numeric values to NA. In other words, try something like this:

这是您的数据:

temp <- structure(list(age = c(64.3573, 69.9043, 65.6633, 50.3693, 57.0334, 
  81.4939, 56.954, 76.9298), CALCIUM = c(1.1, 8.1, 8.6, 8.1, 8.7, 
  1.1, 9.8, 9.1), CREATININE = c(NA, 1.1, 0.8, 1.3, 0.8, NA, 1, 
  0.8), GLUCOSE = structure(c(5L, 4L, 3L, 2L, 6L, 6L, 1L, 6L), .Label = c("", 
  "418", "461", "472", "488", "NEG"), class = "factor")), .Names = c("age", 
  "CALCIUM", "CREATININE", "GLUCOSE"), class = "data.frame", row.names = c(NA, 
  -8L))

其当前结构:

str(temp)
# 'data.frame':  8 obs. of  4 variables:
# $ age       : num  64.4 69.9 65.7 50.4 57 ...
# $ CALCIUM   : num  1.1 8.1 8.6 8.1 8.7 1.1 9.8 9.1
# $ CREATININE: num  NA 1.1 0.8 1.3 0.8 NA 1 0.8
# $ GLUCOSE   : Factor w/ 6 levels "","418","461",..: 5 4 3 2 6 6 1 6

将最后一列转换为数字,但由于它是一个因素,我们需要将其转换为字符。注意警告。我们真的很高兴。

Convert that last column to numeric, but since it's a factor, we need to convert it to character first. Note the warning. We're actually happy about that.

temp$GLUCOSE <- as.numeric(as.character(temp$GLUCOSE))
# Warning message:
# NAs introduced by coercion 

结果:

temp
#       age CALCIUM CREATININE GLUCOSE
# 1 64.3573     1.1         NA     488
# 2 69.9043     8.1        1.1     472
# 3 65.6633     8.6        0.8     461
# 4 50.3693     8.1        1.3     418
# 5 57.0334     8.7        0.8      NA
# 6 81.4939     1.1         NA      NA
# 7 56.9540     9.8        1.0      NA
# 8 76.9298     9.1        0.8      NA






为了好玩,这里有一个我放在一起的功能,提供了一种替代方法:


For fun, here's a little function I put together that provides an alternative approach:

makemeNA <- function (mydf, NAStrings, fixed = TRUE) {
  if (!isTRUE(fixed)) {
    mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x))
    NAStrings <- ""
  }
  mydf[] <- lapply(mydf, function(x) type.convert(
    as.character(x), na.strings = NAStrings))
  mydf
}

此功能可以指定正则表达式来确定应该是一个 NA 值。我没有真正测试过,所以 使用正则表达式功能自己承担风险

This function lets you specify a regular expression to identify what should be an NA value. I haven't really tested it much, so use the regex feature at your own risk!

使用相同的temp对象如上所述,尝试看看该功能的作用:

Using the same "temp" object as above, try these out to see what the function does:

# Change anything that is just text to NA
makemeNA(temp, "[A-Za-z]", fixed = FALSE)
# Change any exact matches with "NEG" to NA
makemeNA(temp, "NEG")
# Change any matches with 3-digit integers to NA
makemeNA(temp, "^[0-9]{3}$", fixed = FALSE)

这篇关于读取R中缺少值的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆