快速查看另一字符串中是否包含大量字符串的方法 [英] Faster way to see if a huge list of strings is contained within another string
问题描述
我在一个数组中存储了大约30万个常用单词的列表.因此,数组的1个元素= 1个字.
I have a list of around 300k common words stored in an array. So, 1 element of the array = 1 word.
另一方面,我有大量的字符串列表,其中可能包含这300k单词中的一个或多个.示例字符串为:ifdxawesome453
.
On the other side, I have a huge list of strings that MAY contain one or more of these 300k words inside them. A sample string would be: ifdxawesome453
.
现在,我需要对照常用单词检查这些长字符串中的每一个.如果在该字符串中找到一个单词,请立即返回.因此,我需要再次检查30万个单词ifdxawesome453
,看看其中是否包含任何单词.
Now, I need to check each of these long strings against the common words. If a word is found within that string, immediately return. So, I need to check the 300k words again ifdxawesome453
and see if any of them is contained within it.
所以我要做的是:
huge_list_of_words.any? do |word|
random_long_word.include?(word)
end
虽然对于随机长单词的小样本来说这是可以的,但如果我有数百万个单词,突然之间需要数小时才能完成工作.
While this is okay for small samples of random long words, if I have millions of them, suddenly it takes hours to finish the job.
有没有办法更快地做到这一点?我想到的唯一方法是,从300k中抽出10k个最常用的单词,然后与之进行比较;如果找不到匹配项,则与完整列表进行比较.
Is there a way to do this faster? The only way I thought of is if I sample out, say 10k most common words out of these 300k and compare against it first, and if no match is found, compare against the full list.
另一种大幅度加快速度的方法是按大小对30万个单词的数组进行分组.然后,将长随机单词与之比较时,我首先检查单词的大小并过滤掉更长的单词.然后,我剩下的索引大小相等或更少,然后从最小大小的单词开始对它们进行搜索.
Another way that drastically sped things up was to group the array of the 300k words by size. When I then compare the long random word against it, I first check if the word's size and filter out any longer words. I am then left with the indexes of the equal size or less words, and search against them, starting from the word with the smallest size.
推荐答案
在此特定用例中,Trie gem和Triez gem之间的基准:
A benchmark between the Trie gem and Triez gem in this particular use case:
word count: 228982
user system total real
trie 13.410000 0.050000 13.460000 ( 13.463473)
triez 11.080000 0.010000 11.090000 ( 11.102195)
trie tail 39.920000 0.140000 40.060000 ( 40.102285)
triez tail 28.960000 0.030000 28.990000 ( 29.022630)
通常,Triez在Op的用例中更快.
Generally, Triez is faster for the Op's use case.
require 'triez'
require 'trie'
require 'benchmark'
DICT = '/usr/share/dict/web2'
triez = Triez.new value_type: :object, default: nil
trie = Trie.new
count = 0
File.foreach(DICT) do |word|
word.chomp!
if word.size > 4
triez[word] = word
trie.add word
count += 1
end
end
puts "word count: #{count}"
def in_trie?(str, trie)
0.upto(str.length - 1) do |i|
node = trie.root
i.upto(str.length - 1) do |j|
break unless node.walk! str[j]
if node.terminal?
return str[i..j]
end
end
end
nil
end
def in_triez?(str, triez)
triez.change_all(:substring, str) do |v|
return v if v
end
nil
end
Benchmark.bm(12) do |b|
b.report('trie') do
1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
end
b.report('triez') do
1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
end
b.report('trie tail') do
1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
end
b.report('triez tail') do
1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
end
end
rambling-trie 的更新基准,其中c
prefix是压缩版本. (注意:ROUND已减少到100K,而不是前缀基准中的1M)
UPDATE benchmark for rambling-trie, where the lines with c
prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)
Word count: 228982, ROUND: 100000
user system total real
trie 1.510000 0.000000 1.510000 ( 1.511772)
triez 1.170000 0.000000 1.170000 ( 1.176075)
rambling 4.800000 0.010000 4.810000 ( 4.847021)
c rambling 25.060000 0.050000 25.110000 ( 25.172771)
trie tail 4.540000 0.010000 4.550000 ( 4.566233)
triez tail 3.080000 0.010000 3.090000 ( 3.092655)
rambling tail 4.780000 0.010000 4.790000 ( 4.803114)
c rambling tail 23.470000 0.020000 23.490000 ( 23.525066)
似乎rambling-trie纯粹是在Ruby中实现的,它不提供直接方法来进行前缀匹配.首先需要添加以下猴子补丁.可能会有更好的实现,但我没有做进一步的探讨.
It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.
class Rambling::Trie::Container
def match_prefix?(str)
root.match_prefix?(str.chars)
end
end
class Rambling::Trie::RawNode
def match_prefix?(chars, i = 0)
if children_tree.empty?
true
elsif i >= chars.size
false
else
letter = chars[i].to_sym
child = children_tree[letter]
!!child && child.match_prefix?(chars, i + 1)
end
end
end
class Rambling::Trie::CompressedNode
def match_prefix?(chars)
if children_tree.empty?
true
if chars.empty?
false
else
!!(recursive_get :match_prefix?, chars)
end
end
end
def in_r_trie?(str, r_trie)
0.upto(str.length - 1) do |i|
if r_trie.match_prefix? str[i..-1]
return true
end
end
false
end
这篇关于快速查看另一字符串中是否包含大量字符串的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!