近似​​字符串匹配 [英] Approximate string matching

查看:151
本文介绍了近似​​字符串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道这个问题已经被问了很多的时间。
我想一个建议上算法适用于近似串匹配。

I know this question have been asked a lot of time. I want a suggestion on which algorithm is suitable for approximate string matching.

该应用程序是专门为公司名称仅匹配而已。

The application is specifically for company name matching only and nothing else.

最大的挑战可能是公司端名称的一部分,短名为part
示例:
1.公司A PTY LTD VS公司A PTY。 LTD。 VS公司A
2. WES工程VS W.E.S.工程(极其罕见的次数)

The biggest challenge is probably the company end name part and short named part Example: 1. companyA pty ltd vs companyA pty. ltd. vs companyA 2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)

你以为的Levenshtein编辑距离是否足够?

Do you think Levenshtein Edit Distance is adequate?

我使用C#

问候,
。最大

Regards, Max

推荐答案

有多种字符串距离度量,你可以使用。

There are various string distance metrics you could use.

我会建议哈罗 - 温克勒。不像编辑距离,其中一个比较的结果是在编辑的离散单位,JW给你一个0-1的分数。它特别适用于适当的名称。另外,也要看看这个漂亮教程并的this~~V SO问题。

I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.

我还没有与C#的工作,但这里有JW的一些实现我在网上找到:

I haven't worked with C# but here are some implementations of JW I found online:

Impl 1 (他们有一个DOT NET版本太多,如果你看一下文件列表)

Impl 1 (They have a DOT NET version too if you look at the file list)

默认地将Impl 2

如果你想要做一些更成熟的配套,你可以尝试做的词形一些定制正常化,经常发生公司名称,如有限公司/有限,INC /注册,CORP /公司如果你计算占不区分大小写,缩写等等。这样

If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute

距离(正常化(富总公司),
正常化(FOO公司))

你应该得到的结果为0,而不是14(这是你会得到什么,如果你计算的Levenshtein编辑距离)。

you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

这篇关于近似​​字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆