从多个文件中提取数据时如何编写 tryCatch() 函数? [英] how to write tryCatch() function when extracting data from multiple files?

查看:20
本文介绍了从多个文件中提取数据时如何编写 tryCatch() 函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理数千个文件(带有 .MOD 扩展名),我想从所有这些文件中提取特定信息.然后将这些信息收集到一个 Excel 表中,每行代表从一个 .MOD 文件中提取的信息.我已经做到了.

I am currently working on thousands of files (with .MOD extension) where I want to extract specific information from all these files. These information will then be collected into one excel sheet in such a way that each row represents information extracted from one .MOD file. I have managed to do this.

但是,可以说大约有 10-20 个文件(数以万计)不包含我想要的格式的信息,因此会引发错误.我当然不能手动继续挖掘所有文件,或者每次都不能对它们进行子集化以找出这些文件中的哪一个引发了错误.因此,我想包含一个 tryCatch() 函数,以便脚本仍然继续运行而不停止.对于给出错误的文件,我只想将这些特定单元格中的值替换为错误".谁能帮我怎么做?

However, there are lets say about 10-20 files (out of the tens of thousands) that do not contain information in the format that I want, and this therefore throws an error. I cannot of course manually keep digging into all the files, or cannot subset them each time to find which of these files is throwing the error. Therefore, I want to include a tryCatch() function, so that the script still continues to run without stopping. For the files that give error, I simply want the values to be replaced by "Error" in those specific cells. Can anyone help me how to do that?

以下是我希望我的最终 excel 输出的样子:

Following is how I want my final excel output to look like:

ID  COL1    COL2    COL3    COL4    COL5    COL6    COL7    COL8
Sample1 9-5-2014    10:42:41    600 1207    3   2   62  30
Sample2 8-1-2013    08:44:50    654 1873    1   7   60  45
Sample3 2-3-2013    14:47:40    767 1645    1   18  66  37
Sample4 8-2-2013    08:50:45    727 1500    1   8   68  45
Sample5 4-1-2013    13:08:49    Error   Error   Error   Error   Error   Error
Sample6 1-2-2013    13:08:47    720 1433    1   16  60  51
Sample7 3-4-2013    13:59:04    610 1343    2   13  66  32

以下是我的代码(连同错误):

Following is my code (along with the error):

AR.MOD.files <- list.files(pattern = "AR.MOD|ar.MOD")
    for (fileName in AR.MOD.files) {
    AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
    AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
    names(AR.MOD.subset1) <- c("COL1", "COL2")
    AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
    names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
    AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
    ID <- basename(fileName)
    AR.MOD.final <- merge (ID, AR.MOD.final)
    colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"
    if(match(fileName,AR.MOD.files)==1){
            output.AR.MOD <- AR.MOD.final
        }else{
            output.AR.MOD <- rbind(output.AR.MOD,AR.MOD.final)}
        }
Error in `[.data.frame`(AR.MOD, c(3), 3:8) : undefined columns selected
    output.AR.MOD$ID <- gsub("AR.MOD", "", paste(output.AR.MOD$ID))
    output.AR.MOD$ID <- gsub("ar.MOD", "", paste(output.AR.MOD$ID))
    print(output.AR.MOD)

我在这里分享 2 个示例文件:

I here share 2 example files:

> AR.MOD <- read.table("Sample1ar.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3       V4   V5    V6   V7    V8
1 Case  1 23-3-2013 14:47:40                      
2  Run NA                                         
3    R  1    767,96  1647,72 1,78 18,88 0,66 37,33

> AR.MOD <- read.table("Sample2AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2       V3       V4   V5   V6   V7    V8
1 Case  1 9-5-2014 10:42:41                     
2  Run NA                                       
3    R  1   566,47  1207,22 3,05 2,95 0,62 30,00

它适用于上述 2 个示例.但是,如果缺少一列,让我们在下面说,然后它会抛出错误.

It works with the above 2 examples. However, if one of the column is missing, lets say in the following, then it throws error.

> AR.MOD <- read.table("Sample3AR.MOD", header = FALSE, fill = TRUE)
> AR.MOD
    V1 V2        V3      V4   V5   V6   V7
1 Case  1 28-1-2013 8:44:50                     
2  Run NA                                       
3    R  1    783,76 1873,70 1,34 7,48 0,60

此时我不确定它来自哪个文件,但我在这里向您发送了上面第三个示例中的一个虚拟示例.我无法直接在此处附加文件,这就是我阅读它并将其作为输出发送给您的原因.

I am at this point not sure which file it is coming from, but I here send you a dummy example in the 3rd sample from above. I am not able to attach files directly here, that is why I read it and send you as an output.

推荐答案

我会回应 lapply 解决方案,在单独的列表元素中制作表格,然后处理组合.下面是一个使用 data.table 包的例子,它用 NA 填充数据,在它找不到的地方:

I'd echo the lapply solution to make the tables in individual list elements and then handle the combination afterwards. Here is an example using the data.table package that fills the data with NA's where it can't find it:

# # for installing:
# install.packages(data.table)
library(data.table)

# generate tables with uneven columns
set.seed(1)
tables <- lapply(1:10, function(i){
  ncols <- sample(1:5, 1, 1)
  out <- as.data.frame(matrix(runif(ncols), nrow=1, ncol=ncols))
})

# you can use rbindlist with fill=TRUE to fill the bad values with NA
output <- as.data.frame(rbindlist(tables, fill=TRUE))

我不能确定这是否会奏效,但请试一试:

I can't be certain this will work off the bat, but give it a try:

# # for installing:
# install.packages(data.table)
library(data.table)

# Set this to what you expect max to be 
ncol_total <- 9
tables <- lapply(AR.MOD.files, function(fileName){
  AR.MOD <- read.table(fileName, header = FALSE, fill = TRUE)
  AR.MOD.subset1 <- AR.MOD[c(1), 3:4]
  names(AR.MOD.subset1) <- c("COL1", "COL2")
  AR.MOD.subset2 <- AR.MOD[c(3), 3:8]
  names(AR.MOD.subset2) <- c("COL3", "COL4", "COL5", "COL6", "COL7", "COL8")
  AR.MOD.final <- merge(AR.MOD.subset1, AR.MOD.subset2)
  ID <- basename(fileName)
  AR.MOD.final <- merge (ID, AR.MOD.final)
  colnames(AR.MOD.final)[colnames(AR.MOD.final)=="x"] <- "ID"

  # add in missing data
  ncol_file <- ncol(AR.MOD.final)
  missing <- ncol_total - ncol_file
  if(missing > 0){
    new_data <- as.data.frame(matrix("Error", nrow=nrow(AR.MOD.final), ncol=missing))
    AR.MOD.final <- cbind(AR.MOD.final, AR.MOD.final)
  }

  AR.MOD.final
})

# this will likely screw up the column names. Its better to know what these
# are and assign after, as long as the tables are all in the same order
output <- as.data.frame(rbindlist(tables, use.names = FALSE))
names(output) <- c("ID", "COL1", "COL2", "COL3", "COL4", "COL5", "COL6", "COL7"
                   "COL8")

# continuing on
output$ID <- gsub("AR.MOD", "", paste(output$ID))
output$ID <- gsub("ar.MOD", "", paste(output$ID))
print(output)

这篇关于从多个文件中提取数据时如何编写 tryCatch() 函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆