fread:na.strings中的空字符串(“")不解释为NA [英] fread: empty string ("") in na.strings is not interpreted as NA

查看:195
本文介绍了fread:na.strings中的空字符串(“")不解释为NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何获取 fread()以将包括字符变量在内的所有变量的""设置为 NA ?

How can I get fread() to set "" to a NA for all variables including character variables?

我正在导入一个.csv文件,其中缺少的值是空字符串("" ;没有空格).我想将" 解释为缺少值 NA ,并尝试使用"na.strings ="".没有成功:

I am importing a .csv file where missing values are empty strings (""; no space). I want "" to be interpreted as missing value NA and tried `na.strings = "" without success:

data <- fread("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      ""            

另一方面,当我将 read.csv na.strings =''一起使用时,"" 被转换为 NA ,即使对于字符变量也是如此.这是我想要的结果.

On the other hand, when I use read.csv with na.strings = "", the "" are turned into NAs, even for character variables. This is the result I want.

data <- read.csv("file.csv", na.strings = "")

unique(data$character_variable)
# [1] "abc" "def"      NA

版本

  • R版本3.6.1(2019-07-05)
  • data.table_1.12.8

推荐答案

好吧,如果您的csv文件看起来像这样

Well, you can't if your csv file looks like this

a,b
x,y
"",1

请注意,由于" 是转义字符,因此在""内部的任何内容均被视为字符串文字.从这种意义上讲,csv文件中的,",只是表示一个空字符串,而不是缺少的值(即).我认为这是保持一致性的一个好功能.这也写在 fread 的文档的 na.strings 部分:

Note that whatever inside the "" is treated as a string literal because "" are the escape characters. In that sense, ,"", in a csv file just means an empty string, but not a missing value (i.e. ,,). I would consider this a good feature for consistency. This is also written in the section na.strings of the documentation of fread:

将被解释为 NA 值的字符串的字符向量.默认情况下,读取所有类型的列的",," ,包括类型为 character 的列.为 NA 以保持一致性. ,",明确无误,并读为空字符串.要将,NA,读取为 NA ,请设置 na.strings ="NA" .要读取作为空白字符串"" ,请设置 na.strings = NULL .当它们出现在文件中时, na.strings 中的字符串不应出现在引号中,因为这是字符串文字,"NA",的区别,例如,当 na.strings ="NA" 时.

A character vector of strings which are to be interpreted as NA values. By default, ",," for columns of all types, including type character is read as NA for consistency. ,"", is unambiguous and read as an empty string. To read ,NA, as NA, set na.strings="NA". To read ,, as blank string "", set na.strings=NULL. When they occur in the file, the strings in na.strings should not appear quoted since that is how the string literal ,"NA", is distinguished from ,NA,, for example, when na.strings="NA".

另一方面,您可能会注意到,如果文件看起来像这样

On the other hand, you may notice that if the file looks like this

a,b
1,y
"",1

,则空字符串将转换为 NA .但是,我认为这不是错误,因为此行为可能是解析器进行类型强制的结果.在同一文档的 Details 部分中,您可以看到

, then the empty string will be converted into NA. However, I think it's not a bug because this behaviour is probably a consequence of type coercion by the parser. In the Details section of the same document, you can see that

从有序列表中选择每列的最低类型:逻辑整数 integer64 double 字符.

因此,列 a 首先被读取为字符列,然后转换为整数.空字符串仍然按原样读取,但在第二步中被强制为 NA_integer _ .

So column a is first read as a character column and later converted into an integer one. The empty string is still read as is but coerced into an NA_integer_ in the second step.

这篇关于fread:na.strings中的空字符串(“")不解释为NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆