近似字符串匹配 [英] Approximate string matching
问题描述
我知道这个问题已经被问了很多的时间。
我想一个建议上算法适用于近似串匹配。
I know this question have been asked a lot of time. I want a suggestion on which algorithm is suitable for approximate string matching.
该应用程序是专门为公司名称仅匹配而已。
The application is specifically for company name matching only and nothing else.
最大的挑战可能是公司端名称的一部分,短名为part
示例:
1.公司A PTY LTD VS公司A PTY。 LTD。 VS公司A
2. WES工程VS W.E.S.工程(极其罕见的次数)
The biggest challenge is probably the company end name part and short named part Example: 1. companyA pty ltd vs companyA pty. ltd. vs companyA 2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)
你以为的Levenshtein编辑距离是否足够?
Do you think Levenshtein Edit Distance is adequate?
我使用C#
问候,
。最大
Regards, Max
推荐答案
有多种字符串距离度量,你可以使用。
There are various string distance metrics you could use.
我会建议哈罗 - 温克勒。不像编辑距离,其中一个比较的结果是在编辑的离散单位,JW给你一个0-1的分数。它特别适用于适当的名称。另外,也要看看这个漂亮教程并的this~~V SO问题。
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
我还没有与C#的工作,但这里有JW的一些实现我在网上找到:
I haven't worked with C# but here are some implementations of JW I found online:
的Impl 1 (他们有一个DOT NET版本太多,如果你看一下文件列表)
Impl 1 (They have a DOT NET version too if you look at the file list)
如果你想要做一些更成熟的配套,你可以尝试做的词形一些定制正常化,经常发生公司名称,如有限公司/有限,INC /注册,CORP /公司
如果你计算占不区分大小写,缩写等等。这样
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation
to account for case insensitivity, abbreviations etc. This way if you compute
距离(正常化(富总公司),
正常化(FOO公司))
你应该得到的结果为0,而不是14(这是你会得到什么,如果你计算的Levenshtein编辑距离)。
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
这篇关于近似字符串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!