巨大字符串上的Ruby字符串操作 [英] Ruby String operations on HUGE String

查看:84
本文介绍了巨大字符串上的Ruby字符串操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大小约为10 GB的字符串(c ..占用大量RAM). 问题是,我需要执行gsub之类的字符串操作并对其进行分割. 我注意到Ruby会在某个时候停止工作"(尽管不会产生任何错误).

I have a string that is ~10 GB in size (huge RAM usage ofc..). The thing is, I need to perform string operations like gsub and split on it. I noticed that Ruby will just "stop working" at some point (without yielding any errors though).

示例:

str = HUGE_STRING_10_GB

# I will try to split the string using .split:
str.split("\r\n")
# but Ruby will instead just return an array with 
# the full unsplitted string itself...

# let's break this down:
# each of those attempts doesn't cause problems and 
# returns arrays with thousands or even millions of items (lines)
str[0..999].split("\r\n")
str[0..999_999].split("\r\n")
str[0..999_999_999].split("\r\n")

# starting from here, problems will occur
str[0..1_999_999_999].split("\r\n")

我正在使用Ruby MRI 1.8.7, 这是怎么了 为什么Ruby无法对巨大的字符串执行字符串操作? 那么这里有什么解决方案?

I'm using Ruby MRI 1.8.7, what is wrong here? Why is Ruby not able to perform string operations on huge strings? And what is a solution here?

我想出的唯一解决方案是使用[0..9],[10..19],...遍历"字符串,并部分地执行字符串操作.但是,这似乎并不可靠,例如,如果我的分隔定界符很长并且位于两个部分"之间,该怎么办?

The only solution I came up with is to "loop" through the string using [0..9], [10..19],... and to perform the string operations part by part. However this seems unreliable, for example what if my split delimiter is very long and falls between two "parts".

另一个可行的解决方案是像str.each_line {..}一样迭代字符串. 但这只是替换了换行符.

Another solution that actually works fine is to iterate the string by like str.each_line {..}. However this just replaces newline delimiters.

感谢所有这些答案. 就我而言,巨大的10 GB STRING"实际上是从互联网上下载的. 它包含由特定序列(在大多数情况下为简单的换行符)定界的数据. 在我的场景中,我将10 GB文件的每个元素与脚本中已经拥有的另一个(较小的)数据集进行了比较.我感谢所有建议.

Thanks for all those answers. In my case, the "HUGE 10 GB STRING" is actually a download from the internet. It contains data that is delimited by a specific sequence (in most cases a simple newline). In my scenario I compare EACH ELEMENT of the 10 GB file to another (smaller) data-set that I already have in my script. I appreciate all suggestions.

推荐答案

此处是针对实际日志文件的基准.在用于读取文件的方法中,只有使用foreach的方法才是可伸缩的,因为它避免了对文件的处理.

Here's a benchmark against a real-life log file. Of the methods used to read the file, only the one using foreach is scalable because it avoids slurping the file.

使用lazy会增加开销,因此比单独使用map的时间要慢.

Using lazy adds overhead, resulting in slower times than map alone.

请注意,只要处理速度快,foreach就在其中,并提供可扩展的解决方案. Ruby不在乎文件是几千行还是一千亿TB,它一次只能看到一行.请参阅"为什么拖拽"文件不是一个好习惯? /a>"以获取有关读取文件的一些相关信息.

Notice that foreach is right in there as far as processing speed goes, and results in a scalable solution. Ruby won't care if the file is a zillion lines or a zillion TB, it's still only seeing a single line at a time. See "Why is "slurping" a file not a good practice?" for some related information about reading files.

人们通常倾向于使用一种可一次提取整个文件的内容,然后将其拆分为多个部分.这忽略了Ruby随后必须使用split或类似方法基于行尾重建数组的工作.这就加起来了,这就是为什么我认为foreach领先.

People often gravitate to using something that pulls in an entire file at once, then splitting it into parts. That ignores the job Ruby then has to do to rebuild the array based on line ends using split or something similar. That adds up, and is why I think foreach pulls ahead.

还要注意,在两次基准测试之间,结果略有不同.这可能是由于作业在运行时Mac Pro上正在运行的系统任务所致.重要的是,这表明差异是一种洗礼,向我确认使用foreach是处理大文件的正确方法,因为如果输入文件超出可用内存,它不会杀死计算机.

Also notice that the results shift a little between the two benchmark runs. This is probably due to system tasks running on my Mac Pro as the jobs are running. The important thing is that shows the difference is a wash, confirming to me that using foreach is the right way to process big files, because it's not going to kill the machine if the input file exceeds available memory.

require 'benchmark'

REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
  File.open("lazy_map.out", 'w') do |fo|
    fo.puts File.readlines(filename).lazy.map { |li|
      li.gsub(REGEX, 'bar')
    }.force
  end
end

# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
  File.open("map.out", 'w') do |fo|
    fo.puts File.readlines(filename).map { |li|
      li.gsub(REGEX, 'bar')
    }
  end
end

# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
# 
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
  File.open("readlines.out", 'w') do |fo|
    File.readlines(filename).each do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
#
# It's slower, but it works reliably.
def foreach(filename)
  File.open("foreach.out", 'w') do |fo|
    File.foreach(filename) do |li|
      fo.puts li.gsub(REGEX, 'bar')
    end
  end
end

puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"

2.times do
  Benchmark.bm(13) do |b|
    b.report('lazy_map')  { lazy_map(LOG)  }
    b.report('map')       { map(LOG)       }
    b.report('readlines') { readlines(LOG) }
    b.report('foreach')   { foreach(LOG)   }
  end
end

%w[lazy_map map readlines foreach].each do |s|
  puts `wc #{ s }.out`
end

这将导致:

Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
                    user     system      total        real
lazy_map       35.010000   4.120000  39.130000 ( 43.688429)
map            29.510000   7.440000  36.950000 ( 43.544893)
readlines      28.750000   9.860000  38.610000 ( 43.578684)
foreach        25.380000   4.120000  29.500000 ( 35.414149)
                    user     system      total        real
lazy_map       32.350000   9.000000  41.350000 ( 51.567903)
map            24.740000   3.410000  28.150000 ( 32.540841)
readlines      24.490000   7.330000  31.820000 ( 37.873325)
foreach        26.460000   2.540000  29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out

gsub的使用是无害的,因为每种方法都使用它,但这不是必需的,并且它是为增加一些琐碎的电阻性负载而添加的.

The use of gsub is innocuous since every method uses it, but it's not needed and was added for a bit of frivolous resistive loading.

这篇关于巨大字符串上的Ruby字符串操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆