字符串匹配技术 [英] String matching techniques

查看:28
本文介绍了字符串匹配技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下字符串被视为相等.我怎样才能匹配这样的东西?

The following strings are considered equal. How can I match stuff like this?

"Hazard Const. Company"
"hazard construction company"

"PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC"
"peterson-chase general  engineering construction inc"

"TRAFFIC DEVELOPMENT SERVICES "
"traffic development services"

我的环境是 ruby​​,但我只是想知道匹配字符串的一般原则.由于空格问题和缩写,上述示例不适用于基本的a"==b".我可以使用正则表达式忽略大小写或小写字符串来缓解大小写问题...

My environment is ruby, but I'm just wondering general principles to match strings. The above examples don't work w/ rudimentary "a"=="b" because of whitespace issues, and abbreviations. I can mitigate casing issues w/ regex case-ignore or downcase the strings...

推荐答案

以下示例比较了所有字符串并计算了 leventhtein 差异(使一个字符串适应另一个字符串所需的击键次数).

The following sample compares all of your strings and computes the levensthtein difference (amount of keystrokes it takes to adapt one string to the other).

基于定义的最大差异和对字符串长度的补偿,然后将字符串作为具有出现次数和值的键放入哈希中.

Based on a defined maximum difference and with a compensation for the lengts of the string it then puts the strings in a hash as a key with the number of occurences als value.

require 'levenshtein'

MAX_DISTANCE, COMPENSATION = 3, 5

strings = [
    "Hazard Const. Company",
    "hazard construction company",
    "PETERSON-CHASE GENERAL ENGINEERING CONSTRUCTION INC",
    "peterson-chase general  engineering construction inc",
    "TRAFFIC DEVELOPMENT SERVICES ",
    "traffic development services"
]

result = {}
strings.each do |s|
    s.downcase!
  similar = result.keys.select { |key| Levenshtein.distance(key, s) < MAX_DISTANCE+(s.length/COMPENSATION) }
  if similar.any?
    result[similar.first] += 1
  else
    result.merge!({s => 1})
  end
end

puts result.inspect
# {"hazard const. company"=>2, "peterson-chase general engineering construction inc"=>2, "traffic development services "=>2}

这篇关于字符串匹配技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆