防止fread()中的列类推断 [英] Preventing column-class inference in fread()

查看:100
本文介绍了防止fread()中的列类推断的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一种方法 fread 模仿 read.table 的行为,其中

我有一个数字数据,在主数据下面有一些注释,这个变量的类是由读入的数据设置的。当我使用 fread 读取数据时,列被转换为字符。但是,通过在read.table中设置 nrow 我可以停止这种行为。这是可能在恐惧。 (我不想改变原始数据或修改副本)。感谢



示例

  d<  -  data.frame x = c(1:100,NA,NA,fff),y = c(1:100,NA,NA,NA))
write.csv(d,test.csv,row。 name = F)

in_d< - read.csv(test.csv,nrow = 100,header = T)
in_dt

这会产生

 > str(in_d)
'data.frame':100 obs。的2个变量:
$ x:int 1 2 3 4 5 6 7 8 9 10 ...
$ y:int 1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes'data.table'和'data.frame':100 obs。的2个变量:
$ x:chr1234...
$ y:int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*,.internal.selfref)=< externalptr>作为解决方法我认为我将能够使用 read.table

/ code>读取一行,获取类并设置 colClasses ,但我是误解。

  cl<  -  read.csv(test.csv,nrow = 1,header = T)
cols< - unname(sapply(cl,class) )
in_dt< - data.table :: fread(test.csv,nrow = 100,colClasses = cols)
str(in_dt)
/ pre>

使用Windows8.1
R版本3.1.2(2014-10-31)
平台:x86_64-w64-mingw32 / x64 (64位)

解决方案

选项1:使用系统命令



fread()允许在其第一个参数中使用系统命令。我们可以使用它来删除文件第一列中的引号。

  indt<  -  data.table :: fread (cat test.csv | tr -d'\',nrows = 100)
str(indt)
#Classes'data.table'和'data.frame':100 obs。的2个变量:
#$ x:int 1 2 3 4 5 6 7 8 9 10 ...
#$ y:int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*,.internal.selfref)=< externalptr>

strong>系统命令 cat test.csv | tr -d'\' 解释:




  • cat test.csv 将文件读入标准输出

  • | 是一个管道,使用上一个命令的输出作为下一个命令的输入

  • d'\'删除所有出现的双引号('\' code>)






选项二: 阅读后强制执行



由于选项1似乎并不适用于您的系统,因此另一种可能是读取该文件, x type.convert()

  library(data.table)
indt2 < - fread(test.csv,nrows = 100)[,x:= type.convert(x)]
str(indt2)
#Classes'data.table'和'data.frame':100 obs。的2个变量:
#$ x:int 1 2 3 4 5 6 7 8 9 10 ...
#$ y:int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*,.internal.selfref)=< externalptr>

附注:我通常喜欢使用 type.convert() over as.numeric(),以避免触发强制引入的一些案例。例如,

  x < -  c(1,4,NA,6)
as.numeric(x)
#[1] 1 4 NA 6
#警告消息:
#强制引入的NAs
type.convert(x)
#[1] 1 4 NA 6

但当然可以使用 as.numeric()






>此答案假设 data.table dev v1.9.5


Is there a way for fread to mimic the behaviour of read.table whereby the class of the variable is set by the data that is read in.

I have numeric data with a few comments underneath the main data. When i use fread to read in the data, the columns are converted to character. However, by setting the nrow in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks

An example

d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA)) 
write.csv(d, "test.csv",  row.names=F)

in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)

Which produces

> str(in_d)
'data.frame':   100 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

As a workaround I thought i would be able to use read.table to read in one line, get the class and set the colClasses, but i am misunderstanding.

cl <- read.csv("test.csv", nrow=1,  header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)

Using Windows8.1 R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)

解决方案

Option 1: Using a system command

fread() allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.

indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

The system command cat test.csv | tr -d '\"' explained:

  • cat test.csv reads the file to standard output
  • | is a pipe, using the output of the previous command as input for the next command
  • tr -d '\"' deletes (-d) all occurrences of double quotes ('\"') from the current input

Option 2: Coercion after reading

Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x column with type.convert().

library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

Side note: I usually prefer to use type.convert() over as.numeric() to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,

x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1]  1  4 NA  6
# Warning message:
# NAs introduced by coercion 
type.convert(x)
# [1]  1  4 NA  6

But of course you can use as.numeric() as well.


Note: This answer assumes data.table dev v1.9.5

这篇关于防止fread()中的列类推断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆