在导入数据框的同时过滤多个csv文件 [英] Filtering multiple csv files while importing into data frame
问题描述
我想读取大量的csv文件。csvs中的所有列标题都是相同的。但是我想只将那些从每个文件中的行导入到变量在给定范围内(高于最小阈值和低于最大阈值)的数据帧,例如
v1 v2 v3
1 xq 2
2 cw 4
3 ve 5
4 br 7
$ c $对于v3(v3> 2&v3< 7)的过滤应该导致:
v1 v2 v3
1 cw 4
2 ve 5
所以我把所有csvs的数据导入到一个数据框中,然后进行过滤:
#读取数据文件
fileNames< - list.files(path = workDir)
mergedFiles< - do.call(rbind,sapply(fileNames,read.csv,simplify = FALSE) )
fileID< - row.names(mergedFiles)$ b $ fileID< - gsub(.csv。*,,fileID)
#数据与文件ID的组合
combFiles = cbind(fileID,mergedFiles)
#根据条件筛选数据
resultFi le< - combFiles [combFiles $ v3>分& combFiles $ v3< [b
$ b 我宁愿在将每个单独的csv文件导入到数据框中时应用过滤器。我假设一个for循环将是最好的方式,但我不知道如何。
我将不胜感激任何建议。
$ p $ 编辑
在测试了mnel的建议之后,我得到了一个不同的解决方案:
(我在1:长度(fileNames)){
tempData = read.csv($)中的 fileNames = list.files(path = workDir)
mzList = (fileNames [i])
mz.idx = which(tempData [,1]> minMZ& tempData [,1]< maxMZ)
mz1 = tempData [mz.idx,]
mzList [[i]] = data.frame(mz1,filename = rep(fileNames [i],length(mz.idx)))
}
resultFile = do.call(rbind ,mzList)
感谢所有的建议!
data.table 的方法,它可以让你使用 fread
(这是比 read.csv
更快)和 rbindlist
这是超快执行 do.call(rbind,list(..))
完美适用于这种情况。它还有一个函数
library(data.table)
fileNames< - list.files(path = workDir)
alldata< - rbindlist(lapply(fileNames,function(x,mon,max){
xx< - fread(x, (v3,lower = min,upper = max,incbounds)之间的差距(单位:美元) = FALSE)]
},min = 2,max = 3))
个别文件很大,并且 v1
总是整数值,所以可能需要设置 v3
作为键,然后使用二进制搜索,导入所有内容然后运行过滤也可能会更快。
I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the data frame for which a variable is within a given range (above min threshold & below max threshold), e.g.
v1 v2 v3
1 x q 2
2 c w 4
3 v e 5
4 b r 7
Filtering for v3 (v3>2 & v3<7) should results in:
v1 v2 v3
1 c w 4
2 v e 5
So fare I import all the data from all csvs into one data frame and then do the filtering:
#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]
I would rather apply the filter while importing each single csv file into the data frame. I assume a for loop would be the best way of doing it, but I am not sure how.
I would appreciate any suggestion.
Edit
After testing the suggestion from mnel, which worked, I ended up with a different solution:
fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)
Thanks for all the suggestions!
解决方案 Here is an approach using data.table
which will allow you to use fread
(which is faster than read.csv
) and rbindlist
which is a superfast implementation of do.call(rbind, list(..))
perfect for this situation. It also has a function between
library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
xx <- fread(x, sep = ',')
xx[, fileID := gsub(".csv.*", "", x)]
xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
}, min = 2, max = 3))
If the individual files are large and v1
always integer values it might be worth setting v3
as a key then using a binary search, it may also be quicker to import everything and then run the filtering.
这篇关于在导入数据框的同时过滤多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文