近似字符串匹配 [英] Approximate string matching

查看：151 发布时间：2016/9/8 18:34:21 c# string matching approximate

本文介绍了近似字符串匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我知道这个问题已经被问了很多的时间。
我想一个建议上算法适用于近似串匹配。

I know this question have been asked a lot of time. I want a suggestion on which algorithm is suitable for approximate string matching.

该应用程序是专门为公司名称仅匹配而已。

The application is specifically for company name matching only and nothing else.

最大的挑战可能是公司端名称的一部分，短名为part
示例：
1.公司A PTY LTD VS公司A PTY。 LTD。 VS公司A
2. WES工程VS W.E.S.工程（极其罕见的次数）

The biggest challenge is probably the company end name part and short named part Example: 1. companyA pty ltd vs companyA pty. ltd. vs companyA 2. WES Engineering vs W.E.S. Engineering (extremely rare occurance)

你以为的Levenshtein编辑距离是否足够？

Do you think Levenshtein Edit Distance is adequate?

我使用C＃

问候，
。最大

Regards, Max

推荐答案

有多种字符串距离度量，你可以使用。

There are various string distance metrics you could use.

我会建议哈罗 - 温克勒。不像编辑距离，其中一个比较的结果是在编辑的离散单位，JW给你一个0-1的分数。它特别适用于适当的名称。另外，也要看看这个漂亮教程并的this~~V SO问题。

I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.

我还没有与C＃的工作，但这里有JW的一些实现我在网上找到：

I haven't worked with C# but here are some implementations of JW I found online:

的Impl 1 （他们有一个DOT NET版本太多，如果你看一下文件列表）

Impl 1 (They have a DOT NET version too if you look at the file list)

的默认地将Impl 2

如果你想要做一些更成熟的配套，你可以尝试做的词形一些定制正常化，经常发生公司名称，如有限公司/有限，INC /注册，CORP /公司如果你计算占不区分大小写，缩写等等。这样

If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation to account for case insensitivity, abbreviations etc. This way if you compute

距离（正常化（富总公司），正常化（FOO公司））

你应该得到的结果为0，而不是14（这是你会得到什么，如果你计算的Levenshtein编辑距离）。

you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).

这篇关于近似字符串匹配的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

近似字符串匹配 [英] Approximate string matching

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

近似​​字符串匹配 [英] Approximate string matching

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

近似字符串匹配 [英] Approximate string matching

登录关闭