读取 CSV 行子集的最快方法 [英] Quickest way to read a subset of rows of a CSV
问题描述
我有一个带有 200 万行的 5GB
csv.标题是逗号分隔的strings
,每一行都是逗号分隔的doubles
,没有丢失或损坏的数据.它是长方形的.
I have a 5GB
csv with 2 million rows. The header are comma separated strings
and each row are comma separated doubles
with no missing or corrupted data. It is rectangular.
我的目标是尽可能快地将随机 10%(替换或不替换,无关紧要)的行读取到 RAM 中.慢速解决方案的一个例子(但比 read.csv
快)是使用 fread
读取整个矩阵,然后保留随机 10% 的行.>
My objective is to read a random 10% (with or without replacement, doesn't matter) of the rows into RAM as fast as possible. An example of a slow solution (but faster than read.csv
) is to read in the whole matrix with fread
and then keep a random 10% of the rows.
require(data.table)
X <- data.matrix(fread('/home/user/test.csv')) #reads full data.matix
X <- X[sample(1:nrow(X))[1:round(nrow(X)/10)],] #sample random 10%
然而,我正在寻找最快的解决方案(这很慢,因为我需要先阅读整篇文章,然后再修整).
However I'm looking for the fastest possible solution (this is slow because I need to read the whole thing first, then trim it after).
值得奖励的解决方案将给出 system.time()
对不同替代方案的估计.
The solution deserving of a bounty will give system.time()
estimates of different alternatives.
其他:
- 我使用的是 Linux
- 我不需要正好 10% 的行.只有大约 10%.
推荐答案
我认为这应该会很快奏效,但请告诉我,因为我还没有尝试过大数据.
I think this should work pretty quickly, but let me know since I have not tried with big data yet.
write.csv(iris,"iris.csv")
fread("shuf -n 5 iris.csv")
V1 V2 V3 V4 V5 V6
1: 37 5.5 3.5 1.3 0.2 setosa
2: 88 6.3 2.3 4.4 1.3 versicolor
3: 84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1 virginica
5: 114 5.7 2.5 5.0 2.0 virginica
对于 iris
数据集,这需要一个 N=5 的随机样本.
This takes a random sample of N=5 for the iris
dataset.
为了避免再次使用标题行的机会,这可能是一个有用的修改:
To avoid the chance of using the header row again, this might be a useful modification:
fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)
这篇关于读取 CSV 行子集的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!