使用fread()选择行和列,read.csv.sql()的方式 [英] Using fread() to select rows and columns, the way read.csv.sql() does

查看:38
本文介绍了使用fread()选择行和列,read.csv.sql()的方式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道 fread 相对较新,但确实可以极大地提高性能.我想知道的是,您可以从正在读取的文件中选择行和列吗?有点像 read.csv.sql 做什么?我知道使用 fread select 选项可以选择要读取的列,但是如何只读取满足特定条件的行呢?

例如,可以使用 fread 实现以下类似内容吗?

  read.csv.sql(file,sql =从文件中选择V2,V4,V7,V8,V9,V10,其中V5 =='CE'并且V10> = 500",标题= FALSE,sep ='|',eol ="\ n") 

如果还无法做到这一点,是否建议先读取全部数据,然后使用 subset 等获得最终结果?还是会破坏使用 fread 的目的?

作为参考,我必须阅读约800个文件,每个文件包含约100,000行和10列.欢迎任何输入.

谢谢.

解决方案

不能像 read.csv.sql()那样使用 fread()选择行.>还没有.但是最好还是读取整个数据(允许内存),然后根据您的条件将其子集化.对于200 mb的文件,与 read.csv.sql()相比, fread() + subset()的性能要好约4倍./p>

因此,使用@Arun的建议

  ans = rbindlist(lapply(文件,函数(x)fread(x)[,fn:= x]))子集(ans,您的标准") 

比原始问题中的方法要好.

I know fread is relatively new, but it really gives great performance improvements. What I want to know is, can you select rows and columns from the file that you are reading? A bit like what read.csv.sql does? I know using the select option of the fread one can select the columns to read, but how about reading only the rows which satisfy a certain criteria.

For example, can something like below be implemented using fread?

read.csv.sql(file, sql = "select V2,V4,V7,V8,V9, V10 from file where V5=='CE' and V10 >= 500",header = FALSE, sep= '|', eol ="\n")

If this is not possible yet, is it advisable to read the entire lot of data, and then use subset etc to arrive at the final result? Or will it defeat the purpose of using fread?

For reference, I have to read about 800 files, each containing about 100,000 rows and 10 columns. Any input is welcome.

Thanks.

解决方案

It is not possible to select rows with fread() as with read.csv.sql() yet. But it is still better to read the entire data (memory permitting) and then subset it as per your criteria. For a 200 mb file, fread()+ subset() gave ~ 4 times better performance than read.csv.sql().

So, using @Arun's suggestion,

ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x]))
subset(ans, 'your criteria')

is better than the approach in the original question.

这篇关于使用fread()选择行和列,read.csv.sql()的方式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆