防止 fread() 中的列类推断 [英] Preventing column-class inference in fread()

查看:16
本文介绍了防止 fread() 中的列类推断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

fread 有没有办法模仿 read.table 的行为,其中变量的 class 由以下数据设置被读入.

Is there a way for fread to mimic the behaviour of read.table whereby the class of the variable is set by the data that is read in.

我有数字数据,主要数据下方有一些评论.当我使用 fread 读取数据时,列将转换为字符.但是,通过在 read.table 中设置 nrow 我可以停止这种行为.这在 fread 中是否可能.(我不希望更改原始数据或制作修改后的副本).谢谢

I have numeric data with a few comments underneath the main data. When i use fread to read in the data, the columns are converted to character. However, by setting the nrow in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks

一个例子

d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA)) 
write.csv(d, "test.csv",  row.names=F)

in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)

哪个产生

> str(in_d)
'data.frame':   100 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

作为一种解决方法,我认为我可以使用 read.table 在一行中读取,获取类并设置 colClasses,但我误解了.

As a workaround I thought i would be able to use read.table to read in one line, get the class and set the colClasses, but i am misunderstanding.

cl <- read.csv("test.csv", nrow=1,  header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)

使用Windows8.1R 版本 3.1.2 (2014-10-31)平台:x86_64-w64-mingw32/x64(64位)

Using Windows8.1 R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)

推荐答案

选项一:使用系统命令

fread() 允许在其第一个参数中使用系统命令.我们可以使用它来删除文件第一列中的引号.

fread() allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.

indt <- data.table::fread("cat test.csv | tr -d '"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

系统命令 cat test.csv |tr -d '"' 解释:

  • cat test.csv 将文件读取到标准输出
  • | 是一个管道,使用上一个命令的输出作为下一个命令的输入
  • tr -d '"' 从当前文本中删除 (-d) 所有出现的双引号 ('"')输入
  • cat test.csv reads the file to standard output
  • | is a pipe, using the output of the previous command as input for the next command
  • tr -d '"' deletes (-d) all occurrences of double quotes ('"') from the current input

选项2:阅读后强制

由于选项 1 似乎不适用于您的系统,另一种可能性是像您一样读取文件,但使用 type.convert()<转换 x 列/代码>.

Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x column with type.convert().

library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

旁注:我通常更喜欢使用 type.convert() 而不是 as.numeric() 以避免 "强制引入的 NA" 在某些情况下会触发警告.例如,

Side note: I usually prefer to use type.convert() over as.numeric() to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,

x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1]  1  4 NA  6
# Warning message:
# NAs introduced by coercion 
type.convert(x)
# [1]  1  4 NA  6

当然你也可以使用 as.numeric().

注意:这个答案假设 data.table dev v1.9.5

这篇关于防止 fread() 中的列类推断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆