R data.table fread 命令:如何读取带有不规则分隔符的大文件? [英] R data.table fread command : how to read large files with irregular separators?

查看:20
本文介绍了R data.table fread 命令:如何读取带有不规则分隔符的大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须处理大约 2 GB(525600 行 x 302 列)的 120 个文件的集合.目标是进行一些统计并将结果放入干净的 SQLite 数据库中.

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

当我的脚本使用 read.table() 导入时一切正常,但速度很慢.所以我已经尝试过使用 data.table 包(版本 1.9.2)中的 fread,但它给了我这个错误:

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

我的数据的前 2 行和 7 行如下所示:

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

因此,开头有第一个空格,然后日期列之间只有一个空格,然后其他列之间有任意数量的空格.

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

我尝试使用这样的命令来转换逗号中的空格:

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\s\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

没有成功:问题仍然存在,使用 sed 命令似乎很慢.

without success : the problem remains and it seems to be slow with the sed command.

Fread 似乎不喜欢任意数量的空间"作为分隔符或开头的空列.有什么想法吗?

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

这是一个(可能)最小的可重现示例(40790 之后的换行符):

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

感谢您的帮助!

更新:- data.table 1.8.* 不会发生错误.在这个版本中,表被读为一个唯一的行,这不是更好.

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

更新 2- 正如评论中提到的,我可以使用 sed 来格式化表格,然后使用 fread 读取它.我在上面的答案中放置了一个脚本,我在其中创建了一个示例数据集,然后比较了一些 system.time ().

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

推荐答案

sed 's/^[[:blank:]]*//;s/[[:blank:]]{1,}/,/g' 

给你sed

不可能将 fread 的所有结果收集到 1 个(临时)文件中(添加源引用)并使用 sed(或其他工具)处理此文件以避免在每次迭代时产生工具分叉?

it's not possible to collect all result of fread into 1 (temporary) file (adding the source reference) and treat this file with sed (or other tool) to avoid a fork of the tools at every iteration ?

这篇关于R data.table fread 命令:如何读取带有不规则分隔符的大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆