在导入数据框的同时过滤多个csv文件 [英] Filtering multiple csv files while importing into data frame

查看:195
本文介绍了在导入数据框的同时过滤多个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取大量的csv文件。csvs中的所有列标题都是相同的。但是我想只将那些从每个文件中的行导入到变量在给定范围内(高于最小阈值和低于最大阈值)的数据帧,例如

  v1 v2 v3 
1 xq 2
2 cw 4
3 ve 5
4 br 7
2&v3< 7)的过滤应该导致:



  v1 v2 v3 
1 cw 4
2 ve 5

所以我把所有csvs的数据导入到一个数据框中,然后进行过滤:

 #读取数据文件
fileNames< - list.files(path = workDir)
mergedFiles< - do.call(rbind,sapply(fileNames,read.csv,simplify = FALSE) )
fileID< - row.names(mergedFiles)$ b $ fileID< - gsub(.csv。*,,fileID)
#数据与文件ID的组合
combFiles = cbind(fileID,mergedFiles)
#根据条件筛选数据
resultFi le< - combFiles [combFiles $ v3>分& combFiles $ v3< [b


$ b

我宁愿在将每个单独的csv文件导入到数据框中时应用过滤器。我假设一个for循环将是最好的方式,但我不知道如何。
我将不胜感激任何建议。

$ p $ 编辑

在测试了mnel的建议之后,我得到了一个不同的解决方案:

 (我在1:长度(fileNames)){
tempData = read.csv($)中的 fileNames = list.files(path = workDir)
mzList = (fileNames [i])
mz.idx = which(tempData [,1]> minMZ& tempData [,1]< maxMZ)
mz1 = tempData [mz.idx,]
mzList [[i]] = data.frame(mz1,filename = rep(fileNames [i],length(mz.idx)))
}
resultFile = do.call(rbind ,mzList)

感谢所有的建议!

data.table 的方法,它可以让你使用 fread (这是 read.csv 更快)和 rbindlist 这是超快执行 do.call(rbind,list(..)) 完美适用于这种情况。它还有一个函数

  library(data.table)
fileNames< - list.files(path = workDir)
alldata< - rbindlist(lapply(fileNames,function(x,mon,max){
xx< - fread(x, (v3,lower = min,upper = max,incbounds)之间的差距(单位:美元) = FALSE)]
},min = 2,max = 3))

个别文件很大,并且 v1 总是整数值,所以可能需要设置 v3 作为键,然后使用二进制搜索,导入所有内容然后运行过滤也可能会更快。

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the data frame for which a variable is within a given range (above min threshold & below max threshold), e.g.

   v1   v2   v3
1  x    q    2
2  c    w    4
3  v    e    5
4  b    r    7

Filtering for v3 (v3>2 & v3<7) should results in:

   v1   v2   v3
1  c    w    4
2  v    e    5

So fare I import all the data from all csvs into one data frame and then do the filtering:

#Read the data files
fileNames <- list.files(path = workDir)
mergedFiles <- do.call("rbind", sapply(fileNames, read.csv, simplify = FALSE))
fileID <- row.names(mergedFiles)
fileID <- gsub(".csv.*", "", fileID)
#Combining data with file IDs
combFiles=cbind(fileID, mergedFiles)
#Filtering the data according to criteria
resultFile <- combFiles[combFiles$v3 > min & combFiles$v3 < max, ]

I would rather apply the filter while importing each single csv file into the data frame. I assume a for loop would be the best way of doing it, but I am not sure how. I would appreciate any suggestion.

Edit

After testing the suggestion from mnel, which worked, I ended up with a different solution:

fileNames = list.files(path = workDir)
mzList = list()
for(i in 1:length(fileNames)){
tempData = read.csv(fileNames[i])
mz.idx = which(tempData[ ,1] > minMZ & tempData[ ,1] < maxMZ)
mz1 = tempData[mz.idx, ]
mzList[[i]] = data.frame(mz1, filename = rep(fileNames[i], length(mz.idx)))
}
resultFile = do.call("rbind", mzList)

Thanks for all the suggestions!

解决方案

Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between

library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
  xx <- fread(x, sep = ',')
  xx[, fileID :=   gsub(".csv.*", "", x)]
  xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
  }, min = 2, max = 3))

If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.

这篇关于在导入数据框的同时过滤多个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆