快速查看另一字符串中是否包含大量字符串的方法 [英] Faster way to see if a huge list of strings is contained within another string

查看:119
本文介绍了快速查看另一字符串中是否包含大量字符串的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个数组中存储了大约30万个常用单词的列表.因此,数组的1个元素= 1个字.

I have a list of around 300k common words stored in an array. So, 1 element of the array = 1 word.

另一方面,我有大量的字符串列表,其中可能包含这300k单词中的一个或多个.示例字符串为:ifdxawesome453.

On the other side, I have a huge list of strings that MAY contain one or more of these 300k words inside them. A sample string would be: ifdxawesome453.

现在,我需要对照常用单词检查这些长字符串中的每一个.如果在该字符串中找到一个单词,请立即返回.因此,我需要再次检查30万个单词ifdxawesome453,看看其中是否包含任何单词.

Now, I need to check each of these long strings against the common words. If a word is found within that string, immediately return. So, I need to check the 300k words again ifdxawesome453 and see if any of them is contained within it.

所以我要做的是:

huge_list_of_words.any? do |word|
  random_long_word.include?(word)
end

虽然对于随机长单词的小样本来说这是可以的,但如果我有数百万个单词,突然之间需要数小时才能完成工作.

While this is okay for small samples of random long words, if I have millions of them, suddenly it takes hours to finish the job.

有没有办法更快地做到这一点?我想到的唯一方法是,从300k中抽出10k个最常用的单词,然后与之进行比较;如果找不到匹配项,则与完整列表进行比较.

Is there a way to do this faster? The only way I thought of is if I sample out, say 10k most common words out of these 300k and compare against it first, and if no match is found, compare against the full list.

另一种大幅度加快速度的方法是按大小对30万个单词的数组进行分组.然后,将长随机单词与之比较时,我首先检查单词的大小并过滤掉更长的单词.然后,我剩下的索引大小相等或更少,然后从最小大小的单词开始对它们进行搜索.

Another way that drastically sped things up was to group the array of the 300k words by size. When I then compare the long random word against it, I first check if the word's size and filter out any longer words. I am then left with the indexes of the equal size or less words, and search against them, starting from the word with the smallest size.

推荐答案

在此特定用例中,Trie gem和Triez gem之间的基准:

A benchmark between the Trie gem and Triez gem in this particular use case:

word count: 228982
                   user     system      total        real
trie          13.410000   0.050000  13.460000 ( 13.463473)
triez         11.080000   0.010000  11.090000 ( 11.102195)
trie tail     39.920000   0.140000  40.060000 ( 40.102285)
triez tail    28.960000   0.030000  28.990000 ( 29.022630)

通常,Triez在Op的用例中更快.

Generally, Triez is faster for the Op's use case.

require 'triez'
require 'trie'
require 'benchmark'

DICT = '/usr/share/dict/web2'

triez = Triez.new value_type: :object, default: nil
trie = Trie.new

count = 0
File.foreach(DICT) do |word|
  word.chomp!
  if word.size > 4
    triez[word] = word
    trie.add word
    count += 1
  end
end

puts "word count: #{count}"

def in_trie?(str, trie)
  0.upto(str.length - 1) do |i|
    node = trie.root
    i.upto(str.length - 1) do |j|
      break unless node.walk! str[j]
      if node.terminal?
        return str[i..j]
      end
    end
  end
  nil
end

def in_triez?(str, triez)
  triez.change_all(:substring, str) do |v|
    return v if v
  end
  nil
end

Benchmark.bm(12) do |b|
  b.report('trie') do
    1_000_000.times { in_trie?('ifdxawesome45someword3', trie) }
  end
  b.report('triez') do
    1_000_000.times { in_triez?('ifdxawesome45someword3', triez) }
  end
  b.report('trie tail') do
    1_000_000.times { in_trie?('ifdx45someword3awesome', trie) }
  end
  b.report('triez tail') do
    1_000_000.times { in_triez?('ifdx45someword3awesome', triez) }
  end
end

rambling-trie 的更新基准,其中c prefix是压缩版本. (注意:ROUND已减少到100K,而不是前缀基准中的1M)

UPDATE benchmark for rambling-trie, where the lines with c prefix is the compressed version. (NOTE: the ROUND has been reduced to 100K rather than 1M in the prefix benchmark)

Word count: 228982, ROUND: 100000
                      user     system      total        real
trie              1.510000   0.000000   1.510000 (  1.511772)
triez             1.170000   0.000000   1.170000 (  1.176075)
rambling          4.800000   0.010000   4.810000 (  4.847021)
c rambling       25.060000   0.050000  25.110000 ( 25.172771)
trie tail         4.540000   0.010000   4.550000 (  4.566233)
triez tail        3.080000   0.010000   3.090000 (  3.092655)
rambling tail     4.780000   0.010000   4.790000 (  4.803114)
c rambling tail  23.470000   0.020000  23.490000 ( 23.525066)

似乎rambling-trie纯粹是在Ruby中实现的,它不提供直接方法来进行前缀匹配.首先需要添加以下猴子补丁.可能会有更好的实现,但我没有做进一步的探讨.

It seems rambling-trie is implemented purely in Ruby, and it doesn't offer direct methods to do prefix matching. The following monkey patches need to be added first. There may be better implementation, but I didn't dig further.

class Rambling::Trie::Container
  def match_prefix?(str)
    root.match_prefix?(str.chars)
  end
end

class Rambling::Trie::RawNode
  def match_prefix?(chars, i = 0)
    if children_tree.empty?
      true
    elsif i >= chars.size
      false
    else
      letter = chars[i].to_sym
      child = children_tree[letter]
      !!child && child.match_prefix?(chars, i + 1)
    end
  end
end

class Rambling::Trie::CompressedNode
  def match_prefix?(chars)
    if children_tree.empty?
      true
    if chars.empty?
      false
    else
      !!(recursive_get :match_prefix?, chars)
    end
  end
end

def in_r_trie?(str, r_trie)
  0.upto(str.length - 1) do |i|
    if r_trie.match_prefix? str[i..-1]
      return true
    end
  end
  false
end

这篇关于快速查看另一字符串中是否包含大量字符串的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆