使用同一命令快速读取不同类型的数据,更好地进行分隔符猜测 [英] Fast read different type of data with same command, better seperator guessing
问题描述
我有LD数据,有时是来自 PLINK 的原始输出文件如下(注意空格-用于使输出漂亮,也注意前导和尾随空格):
I have LD data, sometimes raw output file from PLINK as below (notice spaces - used to make the output pretty, notice leading and trailing spaces, too):
write.table(read.table(text="
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 154834183 rs1218582 1 154794318 rs9970364 0.0929391
1 154834183 rs1218582 1 154795033 rs56744813 0.10075
1 154834183 rs1218582 1 154797272 rs16836414 0.106455
1 154834183 rs1218582 1 154798550 rs200576863 0.0916789
1 154834183 rs1218582 1 154802379 rs11264270 0.176911 ",sep="x"),
"Type1.txt",col.names=FALSE,row.names=FALSE,quote=FALSE)
或用制表符分隔的文件:
Or nicely tab separated file:
write.table(read.table(text="
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
1 154834183 rs1218582 1 154794318 rs9970364 0.0929391
1 154834183 rs1218582 1 154795033 rs56744813 0.10075
1 154834183 rs1218582 1 154797272 rs16836414 0.106455
1 154834183 rs1218582 1 154798550 rs200576863 0.0916789
1 154834183 rs1218582 1 154802379 rs11264270 0.176911", sep=" "),
"Type2.txt",col.names=FALSE,row.names=FALSE,quote=FALSE,sep="\t")
read.csv 适用于两种类型的数据:
read.csv works for both types of data:
read.csv("Type1.txt", sep="")
read.csv("Type2.txt", sep="")
fread 仅适用于Type2:
fread works only for Type2:
fread("Type1.txt")
fread("Type2.txt")
文件很大,有数百万行,因此不能使用read.csv
选项.有没有办法使fread
猜测更好?其他包装/功能建议?
Files are big, in millions of rows, hence can't use read.csv
option. Is there a way to make fread
guess better? Other package/function suggestions?
我可以使用readLines
然后猜测文件的类型,或者使用系统调用然后整理fread
整理文件,但这会增加我想避免的开销.
I could use readLines
then guess the type of file, or tidy up the file using system call then fread
, but this will add overhead I am trying to avoid.
SessionInfo
SessionInfo
R version 3.2.0 (2015-04-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
推荐答案
已修复开发版本v1.9.5的问题.使用devel(/upgrade)或稍等片刻使其以C1.9.6的形式到达CRAN:
Fixed on the devel version, v1.9.5. Either use devel (/upgrade) or wait a while for it to hit CRAN as v1.9.6:
require(data.table) # v1.9.5+
ans <- fread("Type1.txt")
# CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
# 1: 1 154834183 rs1218582 1 154794318 rs9970364 0.0929391
# 2: 1 154834183 rs1218582 1 154795033 rs56744813 0.1007500
# 3: 1 154834183 rs1218582 1 154797272 rs16836414 0.1064550
# 4: 1 154834183 rs1218582 1 154798550 rs200576863 0.0916789
# 5: 1 154834183 rs1218582 1 154802379 rs11264270 0.1769110
在其他参数/错误修复中,
fread()
已获得strip.white
(默认= TRUE
).有关更多信息,请参见项目页面上的README
文件.
fread()
has gained strip.white
(default=TRUE
) amidst other arguments / bug fixes. Please see README
file on project page for more info.
类型也可以正确识别.
sapply(ans, class)
# CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
# "integer" "integer" "character" "integer" "integer" "character" "numeric"
这篇关于使用同一命令快速读取不同类型的数据,更好地进行分隔符猜测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!