重复删除类似的公司名称 [英] Duplicate elimination of similar company names
问题描述
+ ------------------ +
|公司|
+ ------------------ +
| 1c |
| 1c公司|
| 1c游戏工作室|
| 1c无线|
| 1c-avalon |
| 1c-softclub |
| 1c:maddox游戏|
| 1c:接种|
| 1cc游戏|
+ ------------------ +
我发现 Levenshtein距离是消除打字错误的好方法。然而,当添加细分时,Levenshtein距离显着增加,并不再是一个很好的算法。它是否正确?
一般来说,我几乎没有任何计算语言学的经验,所以我很失去什么方法我应该选择。
您会推荐哪些算法用于此问题?我想在java中实现它。纯SQL也可以。链接到来源将不胜感激。谢谢。
这是一个困难的问题。一个可以帮助你的魔术搜索关键字是正常化 - 有时它意味着非常不同的东西(例如数据库规范化是无关的),你正在有效地在这里规范你的输入。 / p>
一个简单的解决方案是使用Levenshtein距离与令牌意识。 Python库Fuzzy Wuzzy执行此操作,并此博客文章介绍如何使用激励例子。基本思想很简单,您应该能够在Java中实现它,而不会有困难。
在高层次上,这个想法是将输入法分成空格和标点符号,然后对令牌进行排序并将其视为一组,然后使用该集合交叉大小 - 允许模糊匹配 - 作为度量。
一些相关链接:
- 是否有良好的图书馆可以使公司名称正常化吗? - 打开数据堆栈交换
- NEMO:从PubMed关联字符串中提取和归一化组织名称
- 使用用户地理编码数据的自动地名词典丰富 - 对于地名,这基本上创建了一个真实名称列表,然后使用模糊查找。
- 使用SPARQL和DBpedia规范公司名称 - bobdc.blog - 使用维基百科重定向信息。
I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":
+------------------+
| company |
+------------------+
| 1c |
| 1c company |
| 1c game studios |
| 1c wireless |
| 1c-avalon |
| 1c-softclub |
| 1c: maddox games |
| 1c:inoco |
| 1cc games |
+------------------+
I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?
In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.
What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.
This is a difficult problem. A magic search keyword that might help you is "normalization" - while sometimes it means very different things ("database normalization" is unrelated, for example), you are effectively trying to normalize your input here.
A simple solution is to use Levenshtein distance with token awareness. The Python library Fuzzy Wuzzy does this and this blog post introduces how it works with motivating examples. The basic idea is simple enough you should be able to implement it in Java without much difficulty.
At a high level, the idea is to split the input into tokens on whitespace and maybe punctuation, then sort the tokens and treat them as a set, then use the set intersection size - allowing for fuzzy matching - as a metric.
Some related links:
- Are there any good libraries available for doing normalization of company names? - Open Data Stack Exchange
- NEMO: Extraction and normalization of organization names from PubMed affiliation strings
- Automatic gazetteer enrichment with user-geocoded data - For place names, this basically creates a list of "true" names and then uses fuzzy lookup.
- Normalizing company names with SPARQL and DBpedia - bobdc.blog - Uses Wikipedia redirect information.
这篇关于重复删除类似的公司名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!