在Java中匹配不精确的公司名称 [英] Matching inexact company names in Java
问题描述
我有一个公司数据库。我的应用程序接收按名称引用公司的数据,但名称可能与数据库中的值不完全匹配。我需要将传入的数据与它所引用的公司进行匹配。
I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.
例如,我的数据库可能包含一个名为A. B. Widgets& Co Ltd.的公司。我的传入数据可能会引用AB Widgets Limited,AB Widgets and Co或AB Widgets。
For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".
公司名称中的一些单词(AB Widgets)匹配比其他(Co,Ltd,Inc等)更重要。避免错误匹配很重要。
Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.
公司数量足够小,我可以在内存中维护他们的名字地图,即。我可以选择使用Java而不是SQL来查找正确的名称。
The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.
您将如何在Java中执行此操作?
How would you do this in Java?
推荐答案
虽然这个帖子有点旧,但我最近对名称匹配的字符串距离指标的效率进行了调查,并且遇到了这个库:
Although this thread is a bit old, I recently did an investigation on the efficiency of string distance metrics for name matching and came across this library:
https://code.google.com/p/java-相似之处/
如果你不想花费多少时间来实现字符串距离算法,我建议尝试第一步,有一个已经实现了~20种不同的算法(包括Levenshtein,Jaro-Winkler,Monge-Elkan算法等),它的代码结构很好,你不必深入理解整个逻辑,但你可以开始使用它在几分钟内。
If you don't want to spend ages on implementing string distance algorithms, I recommend to give it a try as the first step, there's a ~20 different algorithms already implemented (incl. Levenshtein, Jaro-Winkler, Monge-Elkan algorithms etc.) and its code is structured well enough that you don't have to understand the whole logic in-depth, but you can start using it in minutes.
(顺便说一句,我不是图书馆的作者,所以对其创作者来说是赞誉。)
(BTW, I'm not the author of the library, so kudos for its creators.)
这篇关于在Java中匹配不精确的公司名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!