防止 fread() 中的列类推断 [英] Preventing column-class inference in fread()
问题描述
fread
有没有办法模仿 read.table
的行为,其中变量的 class
由以下数据设置被读入.
Is there a way for fread
to mimic the behaviour of read.table
whereby the class
of the variable is set by the data that is read in.
我有数字数据,主要数据下方有一些评论.当我使用 fread
读取数据时,列将转换为字符.但是,通过在 read.table 中设置 nrow
我可以停止这种行为.这在 fread 中是否可能.(我不希望更改原始数据或制作修改后的副本).谢谢
I have numeric data with a few comments underneath the main data. When i use fread
to read in the data, the columns are converted to character. However, by setting the nrow
in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks
一个例子
d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA))
write.csv(d, "test.csv", row.names=F)
in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)
哪个产生
> str(in_d)
'data.frame': 100 obs. of 2 variables:
$ x: int 1 2 3 4 5 6 7 8 9 10 ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ x: chr "1" "2" "3" "4" ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
作为一种解决方法,我认为我可以使用 read.table
在一行中读取,获取类并设置 colClasses
,但我误解了.
As a workaround I thought i would be able to use read.table
to read in one line, get the class and set the colClasses
, but i am misunderstanding.
cl <- read.csv("test.csv", nrow=1, header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)
使用Windows8.1R 版本 3.1.2 (2014-10-31)平台:x86_64-w64-mingw32/x64(64位)
Using Windows8.1 R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)
推荐答案
选项一:使用系统命令
fread()
允许在其第一个参数中使用系统命令.我们可以使用它来删除文件第一列中的引号.
fread()
allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.
indt <- data.table::fread("cat test.csv | tr -d '"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
系统命令 cat test.csv |tr -d '"'
解释:
cat test.csv
将文件读取到标准输出|
是一个管道,使用上一个命令的输出作为下一个命令的输入tr -d '"'
从当前文本中删除 (-d
) 所有出现的双引号 ('"'
)输入
cat test.csv
reads the file to standard output|
is a pipe, using the output of the previous command as input for the next commandtr -d '"'
deletes (-d
) all occurrences of double quotes ('"'
) from the current input
选项2:阅读后强制
由于选项 1 似乎不适用于您的系统,另一种可能性是像您一样读取文件,但使用 type.convert()<转换
x
列/代码>.
Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x
column with type.convert()
.
library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
旁注:我通常更喜欢使用 type.convert()
而不是 as.numeric()
以避免 "强制引入的 NA" 在某些情况下会触发警告.例如,
Side note: I usually prefer to use type.convert()
over as.numeric()
to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,
x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1] 1 4 NA 6
# Warning message:
# NAs introduced by coercion
type.convert(x)
# [1] 1 4 NA 6
当然你也可以使用 as.numeric()
.
注意:这个答案假设 data.table dev v1.9.5
这篇关于防止 fread() 中的列类推断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!