如何使用fread函数读取CSV文件的特定行 [英] How to read specific rows of CSV file with fread function

查看:950
本文介绍了如何使用fread函数读取CSV文件的特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的CSV文件双打(1000万500),我只想读这个文件的几千行(在1万到1000万之间的不同位置),由一个二进制向量定义 V ,如果我不想读取该行,并且 V

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.

如何获取io函数 fread data.table 包执行此操作?我问,因为 fread 与所有其他io方法相比这么快。

How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.

a href =http://stackoverflow.com/questions/19513191/reading-specific-rows-of-large-matrix-data-file?rq=1>读取大型矩阵数据文件的特定行,给出以下解决方案:

The best solution this question, Reading specific rows of large matrix data file, gives the following solution:

read.csv(pipe(paste0(sed -n',paste0(c == 1)+ 1),collapse =p;),p'C:/Data/target.csv,collapse =)),head = TRUE)

其中 C:/Data/target.csv 是大型CSV文件, V 0 1 的向量。

where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.

然而,我注意到,这比在整个矩阵上简单使用 fread 慢一个数量级,即使 V 对于总行数的一小部分将仅等于 1

However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.

因为 fread 在整个矩阵将支配上述解决方案,我如何结合 fread (具体<$ c $

Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?

这不是重复,因为它只是关于函数 fread

This is not a duplicate because it is only about the function fread.

这是我的问题设置:

 #create csv
 csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
 #my csv has a header:
 colnames(csv) <- LETTERS[1:5]
 #save csv
 write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
 #create vector of 0s and 1s that I want to read the CSV from
 read_vec <- rep(0,50)
 read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
 #the following is the effect that I want, but I want an efficient approach to it:
 csv <- read.csv("/home/user/test_csv.csv") #inefficient!
 csv <- csv[which(read_vec==1),] #inefficient!
 #the alternative approach, too slow when scaled up!
 csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
 #the fastest approach yet still not optimal because it needs to read all rows
 require(data.table)
 csv <- data.matrix(fread('/home/user/test_csv.csv'))
 csv <- csv[which(read_vec==1),] 


推荐答案

这种方法需要一个向量 v (对应于您的 read_vec ),标识要读取的行序列,将它们提供给顺序调用 fread(...) rbinds 结果。

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

如果您想要的行随机分布在整个文件,这可能不会更快。但是,如果行是块(例如 c(1:50,55,70,100:500,700:1500)), fread(...),您可能会看到显着改善。

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

# create sample dataset
set.seed(1)
m   <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)

seq  <- rle(v)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])

library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

这篇关于如何使用fread函数读取CSV文件的特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆