模糊文本匹配C# [英] Fuzzy Text Matching C#

查看:64
本文介绍了模糊文本匹配C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个桌面UI(.Net WinForms),以帮助摄影师清理其图像元数据.列出了66k +个词组.谁能建议我使用一个好的开源/免费.NET组件,该组件采用某种算法来识别潜在的合并候选人?例如,可能存在两个或多个条目,它们实际上是相同的单词或短语,只是空格或标点符号或什至轻微的拼写错误而有所不同.该应用程序最终将依靠用户来操作短语的合并,但事实证明,有效地自动找到潜在候选者将是无价之宝.

I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.

推荐答案

让我向您介绍Levenshtein距离公式.太棒了:

Let me introduce you to the Levenshtein distance formula. It is awesome:

http://en.wikipedia.org/wiki/Levenshtein_distance

在信息论和计算机科学中,Levenshtein距离是一个字符串量度,用于测量两个序列之间的差异量.编辑距离一词通常用于专门指Levenshtein距离.

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

我个人是在医疗保健环境中使用此功能的,其中检查了提供者的名称是否重复.使用Levenshtein流程,我们给了他们一个可信度等级,让他们确定它是真实的重复品还是独特的东西.

Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.

这篇关于模糊文本匹配C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆