从R中的多个zip存档中提取一个文本文件 [英] Extracting one text files from multiple zip archives in R

查看:111
本文介绍了从R中的多个zip存档中提取一个文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个文件夹中的每个zip文件中提取一个文本文件.然后,我想将这些文本文件合并为一个数据帧.

I am trying to extract one text file from each of the zip files located in one folder. Then I want to combine those text files into one dataframe.

该文件夹包含多个Zip文件:

The folder has multiple Zip files:

pf_0915.zip
pf_0914.zip
pf_0913.zip
.....

这些zip文件中有多个文本文件.我只对一个叫abc.txt的文件感兴趣.这是不带标题的固定宽度格式文件.我已经使用read_fwd为此文件设置了一个读取.由于所有提取的文本文件都具有相同的名称,因此最好根据其存档名称来重命名它们.即pf_0915.zip中的abc.txt可以称为abc_0915.txt.读取所有文件后,应将它们合并到一个名为abcCombined.txt的大文件中.

Inside of those zip files are multiple text files. I am only interested in the one called abc.txt. This is a fixed width format file without header. I have already set up a read for this file using read_fwd. Since all the extracted text files have the same name, it might be better to rename them according the name of their archive. i.e. the abc.txt from pf_0915.zip could be called abc_0915.txt. Once they are all read they should be combined into a large file called abcCombined.txt.

或者在读取每个新的abc.txt文件时,我们可以将其添加到abcCombined.txt中.

Or as each new abc.txt file is read, we could add it to the abcCombined.txt.

我尝试了各种版本的unzip()和unz(),但都没有成功.这样做无需遍历所有zip文件.最后,该目录包含许多zip文件,可以通过使用模式匹配(如grep)来仅读取其中的一些文件.例如,我只想读取9月的文件,即.._ 09 ... txt.

I have tried various version of unzip() and unz() without much success. This was done without looping through all the zip files. And finally, this directory contains many zip files, are there ways to read only some of them by using pattern matching like grep. I would for example be interested in reading only September files, those .._09...txt.

任何提示将不胜感激.

推荐答案

以下内容:

  1. 在目录中创建文件的向量
  2. 使用unzip()list参数查看内容的元数据
  3. 构建一个正则表达式以仅查找目标文件(在您的用例推广到更广泛的模式的情况下,我这样做了)
  4. 测试是否有文件符合您的条件
  5. 仅将那些文件保留到结果矢量中
  6. 迭代该向量,然后
    • 仅将目标文件提取到一个临时目录中
    • 将其读入data.frame
    • 最终将各个data.frame绑定为一个大的
  1. Creates a vector of the files in a directory
  2. Uses the list parameter to unzip() to see the metadata for the contents
  3. Builds a regular expression to find only the target file (I did that in the event your use-case generalizes to a broader pattern)
  4. Tests if any of the files meet your criteria
  5. Keeps only those files into a resultant vector
  6. Iterates over that vector and
    • Extracts only the target file into a temporary directory
    • Reads it into a data.frame
    • Ultimately binds the individual data.frames into one big one

您可以根据需要写出合并后的data.frame.

You can write out the resultant combined data.frame however you wish.

library(purrr)

target_dir <- "so"
extract_file <- "abc.txt"

list.files(target_dir, full.names=TRUE) %>% 
  keep(~any(grepl(sprintf("^%s$", extract_file), unzip(., list=TRUE)$Name))) %>% 
  map_df(function(x) {
    td <- tempdir()
    read.fwf(unzip(x, extract_file, exdir=td), widths=c(4,1,4,2))
  }) -> combined_df

下面的版本只是扩展了上面的一些快捷方式:

The version below just expands some of the shortcuts in the one above:

only_files_with_this_name <- function(zip_path, name) {
  zip_contents <- unzip(zip_path, list=TRUE)
  look_for <- sprintf("^%s$", name)
  any(grepl(look_for, zip_contents$Name))
}

list.files(target_dir, full.names=TRUE) %>% 
  keep(only_files_with_this_name, name=extract_file)) %>% 
  map_df(function(x) {
    td <- tempdir()
    file_in_zip <- unzip(x, extract_file, exdir=td)
    read.fwf(file_in_zip, widths=c(4,1,4,2))
    unlink(file_in_zip)
  }) -> combined_df

这篇关于从R中的多个zip存档中提取一个文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆