流和使用ruby解压缩大型csv文件 [英] Stream and unzip large csv file with ruby

查看:106
本文介绍了流和使用ruby解压缩大型csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了需要下载,解压缩然后逐行处理非常大的CSV文件的问题.我认为告诉您文件的大小很有用:

I have problem where I need to download, unzip, and then process line by line a very large CSV file. I think it's useful to give you an idea how large the file is:

  • big_file.zip〜700mb
  • big_file.csv〜23gb

这是我想发生的事情:

  • 解压缩前不必下载整个文件
  • 解析csv行之前不必解压缩整个文件
  • 执行所有这些操作时不会占用过多的内存/磁盘

我不知道这是否可行.这就是我的想法:

I don't know if that's possible or not. Here's what I was thinking:

require 'open-uri'
require 'rubyzip'
require 'csv'

open('http://foo.bar/big_file.zip') do |zipped|
  Zip::InputStream.open(zipped) do |unzipped|
    sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv'
    CSV.foreach(unzipped) do |row|
      # process the row, maybe write out to STDOUT or some file
    end
  end
end

这是我所知道的问题:

  • open-uri读取整个响应并将其保存到Tempfile中,这对于具有这种大小的文件来说是不好的.我可能需要直接使用Net::HTTP,但是我不确定该怎么做,仍然得到IO.
  • 我不知道下载的速度如何,或者Zip::InputStream是否按照我显示的方式工作.如果文件还没有全部解压缩,可以解压缩吗?
  • CSV.foreach是否可以与rubyzip的InputStream一起使用?它的行为是否像File那样足以解析行?如果要读取但缓冲区为空,它会吓坏吗?
  • open-uri reads the whole response and saves it into a Tempfile which is no good with a file this size. I'd probably need to use Net::HTTP directly but I'm not sure how to do that and still get an IO.
  • I don't know how fast the download is going to be or if the Zip::InputStream works the way I've shown it working. Can it unzip some of the file when it's not all there yet?
  • Will the CSV.foreach work with rubyzip's InputStream? Does it behave enough like File that it will be able to parse out the rows? Will it freak out if it wants to read but the buffer is empty?

我不知道这是否是正确的方法.也许某些EventMachine解决方案会更好(尽管我以前从未使用过EventMachine,但是如果它对这样的事情更好用,我就全力以赴了.)

I don't know if any of this is the right approach. Maybe some EventMachine solution would be better (although I've never used EventMachine before, but if it works better for something like this, I'm all for it).

推荐答案

自从我发布这个问题已经有一段时间了,万一遇到其他人,我认为可能值得分享我的发现.

It's been a while since I posted this question and in case anyone else comes across it I thought it might be worth sharing what I found.

  1. 对于我使用Ruby的标准库CSV的行数来说,速度太慢.我的csv文件非常简单,以至于我不需要所有的东西来处理带引号的字符串或键入强制.只需使用IO#gets然后用逗号分隔行就容易得多了.
  2. 我无法将整个内容从http流到Zip::Inputstream到某些包含csv数据的IO.这是因为 zip文件结构具有中央目录(EOCD)的结尾在文件末尾.这是提取文件所必需的,因此从http流传输文件似乎不起作用.
  1. For the number of rows I was dealing with Ruby's standard library CSV was too slow. My csv file was simple enough that I didn't need all that stuff to deal with quoted strings or type coercion anyway. It was much easier just use IO#gets and then split the line on commas.
  2. I was unable to stream the entire thing from http to a Zip::Inputstream to some IO containing the csv data. This is because the zip file structure has the End of Central Directory (EOCD) at the end of the file. That is needed in order to extract the file so streaming it from http doesn't seem like it would work.

我最终采用的解决方案是将文件下载到磁盘,然后使用Ruby的open3库和Linux unzip软件包从zip中流式传输未压缩的csv文件.

The solution I ended up going with was to download the file to disk and then use Ruby's open3 library and the Linux unzip package to stream the uncompressed csv file from the zip.

require 'open3'

IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
  line = io.gets
  # do stuff to process the CSV line
end

-p开关解压缩会将提取的文件发送到stdout. IO.popen然后使用管道使该IO对象成为红宝石.效果很好.如果您需要额外的处理,也可以将它与CSV一起使用,这对我来说太慢了.

The -p switch on unzip sends the extracted file to stdout. IO.popen then use pipes to make that an IO object in ruby. Works pretty nice. You could use it with the CSV too if you wanted that extra processing, it was just too slow for me.

require 'open3'
require 'csv'

IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
  CSV.foreach(io) do |row|
    # process the row
  end
end

这篇关于流和使用ruby解压缩大型csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆