从R中的大型.CSV导入和提取随机样本 [英] Importing and extracting a random sample from a large .CSV in R
问题描述
我在R中进行一些分析,我需要处理一些大型数据集(10-20GB,存储在.csv中,并使用read.csv函数)。
I'm doing some analysis in R where I need to work with some large datasets (10-20GB, stored in .csv, and using the read.csv function).
由于我还需要将大型.csv文件与其他数据帧合并和转换,我没有计算能力或内存来导入整个文件。
As I will also need to merge and transform the large .csv files with other data frames, I don't have the computing power or memory to import the entire file.
我想知道是否有人知道如何导入随机百分比的csv。
I was wondering if anyone knows of a way to import a random percentage of the csv.
我有看到一些例子,人们已经导入了整个文件,然后使用一个单独的函数来创建另一个原始样本的数据框,但是我希望能有一些不那么密集的东西。
I have seen some examples where people have imported the entire file and then used a separate function to create another data frame that is a sample of the original, however I am hoping for something a little less intensive.
推荐答案
我认为没有一个好的R工具可以随机读取文件(也许它可以是一个扩展 read.table
或 fread
(data.table包))。
I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table
or fread
(data.table package)) .
使用 perl
您可以轻松完成此任务。例如,要以随机方式读取1%的文件,您可以这样做:
Using perl
you can easily do this task. For example , to read 1% of your file in a random way, you can do this :
xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)
我在这里使用 system
从R调用它。 xx现在只包含1%的文件。
Here I am calling it from R using system
. xx contain now only 1% of your file.
你可以将所有这些包装在一个函数中:
You can wrap all this in a function:
read_partial_rand <-
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}
这篇关于从R中的大型.CSV导入和提取随机样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!