流和使用ruby解压缩大型csv文件 [英] Stream and unzip large csv file with ruby
问题描述
我遇到了需要下载,解压缩然后逐行处理非常大的CSV文件的问题.我认为告诉您文件的大小很有用:
I have problem where I need to download, unzip, and then process line by line a very large CSV file. I think it's useful to give you an idea how large the file is:
- big_file.zip〜700mb
- big_file.csv〜23gb
这是我想发生的事情:
- 解压缩前不必下载整个文件
- 解析csv行之前不必解压缩整个文件
- 执行所有这些操作时不会占用过多的内存/磁盘
我不知道这是否可行.这就是我的想法:
I don't know if that's possible or not. Here's what I was thinking:
require 'open-uri'
require 'rubyzip'
require 'csv'
open('http://foo.bar/big_file.zip') do |zipped|
Zip::InputStream.open(zipped) do |unzipped|
sleep 10 until entry = unzipped.get_next_entry && entry.name == 'big_file.csv'
CSV.foreach(unzipped) do |row|
# process the row, maybe write out to STDOUT or some file
end
end
end
这是我所知道的问题:
-
open-uri
读取整个响应并将其保存到Tempfile
中,这对于具有这种大小的文件来说是不好的.我可能需要直接使用Net::HTTP
,但是我不确定该怎么做,仍然得到IO
. - 我不知道下载的速度如何,或者
Zip::InputStream
是否按照我显示的方式工作.如果文件还没有全部解压缩,可以解压缩吗? -
CSV.foreach
是否可以与rubyzip的InputStream
一起使用?它的行为是否像File
那样足以解析行?如果要读取但缓冲区为空,它会吓坏吗?
open-uri
reads the whole response and saves it into aTempfile
which is no good with a file this size. I'd probably need to useNet::HTTP
directly but I'm not sure how to do that and still get anIO
.- I don't know how fast the download is going to be or if the
Zip::InputStream
works the way I've shown it working. Can it unzip some of the file when it's not all there yet? - Will the
CSV.foreach
work with rubyzip'sInputStream
? Does it behave enough likeFile
that it will be able to parse out the rows? Will it freak out if it wants to read but the buffer is empty?
我不知道这是否是正确的方法.也许某些EventMachine解决方案会更好(尽管我以前从未使用过EventMachine,但是如果它对这样的事情更好用,我就全力以赴了.)
I don't know if any of this is the right approach. Maybe some EventMachine solution would be better (although I've never used EventMachine before, but if it works better for something like this, I'm all for it).
推荐答案
自从我发布这个问题已经有一段时间了,万一遇到其他人,我认为可能值得分享我的发现.
It's been a while since I posted this question and in case anyone else comes across it I thought it might be worth sharing what I found.
- 对于我使用Ruby的标准库
CSV
的行数来说,速度太慢.我的csv文件非常简单,以至于我不需要所有的东西来处理带引号的字符串或键入强制.只需使用IO#gets
然后用逗号分隔行就容易得多了. - 我无法将整个内容从http流到
Zip::Inputstream
到某些包含csv数据的IO
.这是因为 zip文件结构具有中央目录(EOCD)的结尾在文件末尾.这是提取文件所必需的,因此从http流传输文件似乎不起作用.
- For the number of rows I was dealing with Ruby's standard library
CSV
was too slow. My csv file was simple enough that I didn't need all that stuff to deal with quoted strings or type coercion anyway. It was much easier just useIO#gets
and then split the line on commas. - I was unable to stream the entire thing from http to a
Zip::Inputstream
to someIO
containing the csv data. This is because the zip file structure has the End of Central Directory (EOCD) at the end of the file. That is needed in order to extract the file so streaming it from http doesn't seem like it would work.
我最终采用的解决方案是将文件下载到磁盘,然后使用Ruby的open3库和Linux unzip
软件包从zip中流式传输未压缩的csv文件.
The solution I ended up going with was to download the file to disk and then use Ruby's open3 library and the Linux unzip
package to stream the uncompressed csv file from the zip.
require 'open3'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
line = io.gets
# do stuff to process the CSV line
end
-p
开关解压缩会将提取的文件发送到stdout. IO.popen
然后使用管道使该IO
对象成为红宝石.效果很好.如果您需要额外的处理,也可以将它与CSV
一起使用,这对我来说太慢了.
The -p
switch on unzip sends the extracted file to stdout. IO.popen
then use pipes to make that an IO
object in ruby. Works pretty nice. You could use it with the CSV
too if you wanted that extra processing, it was just too slow for me.
require 'open3'
require 'csv'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
CSV.foreach(io) do |row|
# process the row
end
end
这篇关于流和使用ruby解压缩大型csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!