防止fread()中的列类推断 [英] Preventing column-class inference in fread()
问题描述
有一种方法 我有一个数字数据,在主数据下面有一些注释,这个变量的类 fread
模仿 read.table
的行为,其中
是由读入的数据设置的。当我使用
fread
读取数据时,列被转换为字符。但是,通过在read.table中设置 nrow
我可以停止这种行为。这是可能在恐惧。 (我不想改变原始数据或修改副本)。感谢
示例
d< - data.frame x = c(1:100,NA,NA,fff),y = c(1:100,NA,NA,NA))
write.csv(d,test.csv,row。 name = F)
in_d< - read.csv(test.csv,nrow = 100,header = T)
in_dt
这会产生
> str(in_d)
'data.frame':100 obs。的2个变量:
$ x:int 1 2 3 4 5 6 7 8 9 10 ...
$ y:int 1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes'data.table'和'data.frame':100 obs。的2个变量:
$ x:chr1234...
$ y:int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*,.internal.selfref)=< externalptr>作为解决方法我认为我将能够使用 read.table / code>读取一行,获取类并设置 colClasses
,但我是误解。 cl< - read.csv(test.csv,nrow = 1,header = T)
cols< - unname(sapply(cl,class) )
in_dt< - data.table :: fread(test.csv,nrow = 100,colClasses = cols)
str(in_dt)
/ pre>
使用Windows8.1
R版本3.1.2(2014-10-31)
平台:x86_64-w64-mingw32 / x64 (64位)
解决方案 选项1:使用系统命令
fread()
允许在其第一个参数中使用系统命令。我们可以使用它来删除文件第一列中的引号。
indt< - data.table :: fread (cat test.csv | tr -d'\',nrows = 100)
str(indt)
#Classes'data.table'和'data.frame':100 obs。的2个变量:
#$ x:int 1 2 3 4 5 6 7 8 9 10 ...
#$ y:int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*,.internal.selfref)=< externalptr>
strong>系统命令 cat test.csv | tr -d'\'
解释:
-
cat test.csv
将文件读入标准输出
-
| 是一个管道,使用上一个命令的输出作为下一个命令的输入
-
d'\'
删除所有出现的双引号('\' code>)
选项二: 阅读后强制执行
由于选项1似乎并不适用于您的系统,因此另一种可能是读取该文件, x
列 type.convert()
。
library(data.table)
indt2 < - fread(test.csv,nrows = 100)[,x:= type.convert(x)]
str(indt2)
#Classes'data.table'和'data.frame':100 obs。的2个变量:
#$ x:int 1 2 3 4 5 6 7 8 9 10 ...
#$ y:int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*,.internal.selfref)=< externalptr>
附注:我通常喜欢使用 type.convert()
over as.numeric()
,以避免触发强制引入的一些案例。例如,
x < - c(1,4,NA,6)
as.numeric(x)
#[1] 1 4 NA 6
#警告消息:
#强制引入的NAs
type.convert(x)
#[1] 1 4 NA 6
但当然可以使用 as.numeric()
。
>此答案假设 data.table dev v1.9.5
Is there a way for fread
to mimic the behaviour of read.table
whereby the class
of the variable is set by the data that is read in.
I have numeric data with a few comments underneath the main data. When i use fread
to read in the data, the columns are converted to character. However, by setting the nrow
in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks
An example
d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA))
write.csv(d, "test.csv", row.names=F)
in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)
Which produces
> str(in_d)
'data.frame': 100 obs. of 2 variables:
$ x: int 1 2 3 4 5 6 7 8 9 10 ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
$ x: chr "1" "2" "3" "4" ...
$ y: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
As a workaround I thought i would be able to use read.table
to read in one line, get the class and set the colClasses
, but i am misunderstanding.
cl <- read.csv("test.csv", nrow=1, header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)
Using Windows8.1
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
解决方案 Option 1: Using a system command
fread()
allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.
indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
The system command cat test.csv | tr -d '\"'
explained:
cat test.csv
reads the file to standard output
|
is a pipe, using the output of the previous command as input for the next command
tr -d '\"'
deletes (-d
) all occurrences of double quotes ('\"'
) from the current input
Option 2: Coercion after reading
Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x
column with type.convert()
.
library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame': 100 obs. of 2 variables:
# $ x: int 1 2 3 4 5 6 7 8 9 10 ...
# $ y: int 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, ".internal.selfref")=<externalptr>
Side note: I usually prefer to use type.convert()
over as.numeric()
to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,
x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1] 1 4 NA 6
# Warning message:
# NAs introduced by coercion
type.convert(x)
# [1] 1 4 NA 6
But of course you can use as.numeric()
as well.
Note: This answer assumes data.table dev v1.9.5
这篇关于防止fread()中的列类推断的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!