读取 CSV 行子集的最快方法 [英] Quickest way to read a subset of rows of a CSV

查看:26
本文介绍了读取 CSV 行子集的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有 200 万行的 5GB csv.标题是逗号分隔的strings,每一行都是逗号分隔的doubles,没有丢失或损坏的数据.它是长方形的.

I have a 5GB csv with 2 million rows. The header are comma separated strings and each row are comma separated doubles with no missing or corrupted data. It is rectangular.

我的目标是尽可能快地将随机 10%(替换或不替换,无关紧要)的行读取到 RAM 中.慢速解决方案的一个例子(但比 read.csv 快)是使用 fread 读取整个矩阵,然后保留随机 10% 的行.>

My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv) is to read in the whole matrix with fread and then keep a random 10% of the rows.

require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%

然而,我正在寻找最快的解决方案(这很慢,因为我需要先阅读整篇文章,然后再修整).

However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).

值得奖励的解决方案将给出 system.time() 对不同替代方案的估计.

The solution deserving of a bounty will give system.time() estimates of different alternatives.

其他:

  • 我使用的是 Linux
  • 我不需要正好 10% 的行.只有大约 10%.

推荐答案

我认为这应该会很快奏效,但请告诉我,因为我还没有尝试过大数据.

I think this should work pretty quickly, but let me know since I have not tried with big data yet.

write.csv(iris,"iris.csv")

fread("shuf -n 5 iris.csv")

    V1  V2  V3  V4  V5         V6
1:  37 5.5 3.5 1.3 0.2     setosa
2:  88 6.3 2.3 4.4 1.3 versicolor
3:  84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1  virginica
5: 114 5.7 2.5 5.0 2.0  virginica

对于 iris 数据集,这需要一个 N=5 的随机样本.

This takes a random sample of N=5 for the iris dataset.

为了避免再次使用标题行的机会,这可能是一个有用的修改:

To avoid the chance of using the header row again, this might be a useful modification:

fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)

这篇关于读取 CSV 行子集的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆