R:可能截断 >= 4GB 文件 [英] R: possible truncation of >= 4GB file

查看:26
本文介绍了R:可能截断 >= 4GB 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 370MB 的 zip 文件,内容是一个 4.2GB 的 csv 文件.

I have a 370MB zip file and the content is a 4.2GB csv file.

我做到了:

unzip("year2015.zip", exdir = "csv_folder")

我收到了这条消息:

1: In unzip("year2015.zip", exdir = "csv_folder") :
  possible truncation of >= 4GB file

你以前有过这样的经历吗?你是怎么解决的?

Have you experienced that before? How did you solve it?

推荐答案

我同意@Sixiang.Hu 的回答,R 的 unzip() 不能可靠地处理大于 4GB 的文件.

I agree with @Sixiang.Hu's answer, R's unzip() won't work reliably with files greater than 4GB.

为了了解你是如何解决它的?:我已经尝试了一些不同的技巧,根据我的经验,使用 R 的内置函数的结果(几乎)总是在文件实际结束之前错误地识别了文件结束 (EOF) 标记.

To get at how did you solve it?: I've tried a few different tricks with it, and in my experience the result of anything using R's built-ins is (almost) invariably an incorrect identification of the end-of-file (EOF) marker before the actual end of the file.

我在每晚处理的一组文件中处理这个问题,并以一致且自动化的方式处理它,我编写了以下函数来包装 UNIX 解压缩.这基本上就是您使用 system(unzip()) 所做的事情,但在其行为上为您提供了更多灵活性,并允许您更系统地检查错误.

I deal with this issue in a set of files I process on a nightly basis, and to deal with it consistently and in an automated fashion, I wrote the function below to wrap the UNIX unzip. This is basically what you're doing with system(unzip()), but gives you a bit more flexibility in its behavior, and allows you to check for errors more systematically.

decompress_file <- function(directory, file, .file_cache = FALSE) {

    if (.file_cache == TRUE) {
       print("decompression skipped")
    } else {

      # Set working directory for decompression
      # simplifies unzip directory location behavior
      wd <- getwd()
      setwd(directory)

      # Run decompression
      decompression <-
        system2("unzip",
                args = c("-o", # include override flag
                         file),
                stdout = TRUE)

      # uncomment to delete archive once decompressed
      # file.remove(file) 

      # Reset working directory
      setwd(wd); rm(wd)

      # Test for success criteria
      # change the search depending on 
      # your implementation
      if (grepl("Warning message", tail(decompression, 1))) {
        print(decompression)
      }
    }
}    

<小时>

注意事项:

该函数做了一些我喜欢并推荐的事情:

The function does a few things, which I like and recommend:

  • 使用 system2 而不是系统,因为 文档 说system2 是一个比系统更便携、更灵活的界面"
  • 分隔directoryfile参数,并将工作目录移动到directory参数;根据您的系统,解压缩(或您选择的解压缩工具)对于在工作目录外解压缩档案非常挑剔
    • 它不是纯粹的,但重置工作目录是朝着具有更少副作用的功能迈出的重要一步
    • 从技术上讲,您可以在没有此功能的情况下完成此操作,但根据我的经验,与生成文件路径和记住解压缩 CLI 标志相比,使函数更冗长更容易
    • uses system2 over system because the documentation says "system2 is a more portable and flexible interface than system"
    • separates the directory and file arguments, and moves the working directory to the directory argument; depending on your system, unzip (or your choice of decompression tool) gets really finicky about decompressing archives outside the working directory
      • it's not pure, but resetting the working directory is a nice step toward the function having fewer side effects
      • you can technically do it without this, but in my experience it's easier to make the function more verbose than have to deal with generating filepaths and remembering unzip CLI flags
      • 如果您正在测试在解压文件上运行的进程,这会派上用场,因为 4GB 以上的文件往往需要一些时间来解压
      • 最后的 if + grepl 检查会在标准输出中查找警告,如果找到该表达式则打印标准输出
      • an if + grepl check at the end looks for warnings in the stdout, and prints the stdout if it finds that expression

      这篇关于R:可能截断 &gt;= 4GB 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆