计算文件中的行数而不将整个文件读入内存? [英] Count the number of lines in a file without reading entire file into memory?
问题描述
我正在处理巨大的数据文件(每个文件有数百万行).
I'm processing huge data files (millions of lines each).
在开始处理之前,我想对文件中的行数进行计数,以便我可以指出处理的进度.
Before I start processing I'd like to get a count of the number of lines in the file, so I can then indicate how far along the processing is.
由于文件的大小,将整个文件读入内存是不切实际的,仅仅计算有多少行.有没有人对如何做到这一点有好的建议?
Because of the size of the files, it would not be practical to read the entire file into memory, just to count how many lines there are. Does anyone have a good suggestion on how to do this?
推荐答案
如果你在 Unix 环境中,你可以让 wc -l
来完成工作.
If you are in a Unix environment, you can just let wc -l
do the work.
它不会将整个文件加载到内存中;由于它针对流式文件进行了优化并计算字/行,因此性能足够好,而不是自己在 Ruby 中流式传输文件.
It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.
SSCCE:
filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count
或者,如果您想要在命令行上传递的文件集合:
Or if you want a collection of files passed on the command line:
wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count
这篇关于计算文件中的行数而不将整个文件读入内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!