如何在 Ruby 中进行模糊子字符串匹配? [英] How can I do fuzzy substring matching in Ruby?

查看:53
本文介绍了如何在 Ruby 中进行模糊子字符串匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现了很多关于模糊匹配的链接,将一个字符串与另一个字符串进行比较并查看哪个获得最高的相似度分数.

I found lots of links about fuzzy matching, comparing one string to another and seeing which gets the highest similarity score.

我有一个很长的字符串,它是一个文档,还有一个子字符串.子字符串来自原始文档,但经过多次转换,因此可能引入了奇怪的工件,例如这里的空格,那里的破折号.子字符串将匹配原始文档中的一段文本 99% 或更多.我不匹配以查看此字符串来自哪个文档,我试图在该字符串开始的文档中找到索引.

I have one very long string, which is a document, and a substring. The substring came from the original document, but has been converted several times, so weird artifacts might have been introduced, such as a space here, a dash there. The substring will match a section of the text in the original document 99% or more. I am not matching to see from which document this string is, I am trying to find the index in the document where the string starts.

如果字符串是相同的,因为没有引入随机错误,我会使用 document.index(substring),但是如果只有一个字符差异,这将失败.

If the string was identical because no random error was introduced, I would use document.index(substring), however this fails if there is even one character difference.

我认为可以通过删除字符串和子字符串中除az之外的所有字符来解释差异,比较,然后使用我在压缩字符串时生成的索引将压缩字符串中的索引转换为中的索引真正的文件.这在空格和标点符号不同的地方效果很好,但只要一个字母不同,它就会失败.

I thought the difference would be accounted for by removing all characters except a-z in both the string and the substring, compare, and then use the index I generated when compressing the string to translate the index in the compressed string to the index in the real document. This worked well where the difference was whitespace and punctuation, but as soon as one letter is different it failed.

文档通常是几页到一百页,子字符串从几句话到几页.

The document is typically a few pages to a hundred pages, and the substring from a few sentences to a few pages.

推荐答案

你可以试试 amatch.它可以作为 ruby​​ gem 使用,虽然我很长时间没有使用模糊逻辑,但它看起来有你需要的东西.amatch 的主页是:http://flori.github.com/amatch/.

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: http://flori.github.com/amatch/.

只是对这个想法感到无聊和混乱,一个完全未经优化和未经测试的解决方案黑客如下:

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

显然有很多改进是可能的,而且可能是必要的!一些最重要的:

Obviously there are numerous improvements possible and probably necessary! A few off the top:

  1. 处理一次文档并存储结果,可能在数据库中.
  2. 确定字符串的可用长度对于初步检查,过程首先针对该初始子字符串在尝试匹配整个片段.
  3. 继上一个之后,预先计算的起始片段那个长度.

这篇关于如何在 Ruby 中进行模糊子字符串匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆