重复删除类似的公司名称 [英] Duplicate elimination of similar company names

查看:157
本文介绍了重复删除类似的公司名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个公司名称的表。由于人为输入错误,有很多重复。我们希望所有这些重复标记为一个公司1c:

  + ------------------ + 
|公司|
+ ------------------ +
| 1c |
| 1c公司|
| 1c游戏工作室|
| 1c无线|
| 1c-avalon |
| 1c-softclub |
| 1c:maddox游戏|
| 1c:接种|
| 1cc游戏|
+ ------------------ +

我发现 Levenshtein距离是消除打字错误的好方法。然而,当添加细分时,Levenshtein距离显着增加,并不再是一个很好的算法。它是否正确?



一般来说,我几乎没有任何计算语言学的经验,所以我很失去什么方法我应该选择。



您会推荐哪些算法用于此问题?我想在java中实现它。纯SQL也可以。链接到来源将不胜感激。谢谢。

解决方案

这是一个困难的问题。一个可以帮助你的魔术搜索关键字是正常化 - 有时它意味着非常不同的东西(例如数据库规范化是无关的),你正在有效地在这里规范你的输入。 / p>

一个简单的解决方案是使用Levenshtein距离与令牌意识。 Python库Fuzzy Wuzzy执行此操作,并此博客文章介绍如何使用激励例子。基本思想很简单,您应该能够在Java中实现它,而不会有困难。



在高层次上,这个想法是将输入法分成空格和标点符号,然后对令牌进行排序并将其视为一组,然后使用该集合交叉大小 - 允许模糊匹配 - 作为度量。



一些相关链接:




I have a table with company names. There are many duplicates because of human input errors. There are different perceptions if the subdivision should be included, typos, etc. I want all these duplicates to be marked as one company "1c":

+------------------+
|     company      |
+------------------+
| 1c               |
| 1c company       |
| 1c game studios  |
| 1c wireless      |
| 1c-avalon        |
| 1c-softclub      |
| 1c: maddox games |
| 1c:inoco         |
| 1cc games        |
+------------------+

I identified Levenshtein distance as a good way to eliminate typos. However, when the subdivision is added the Levenshtein distance increases dramatically and is no longer a good algorithm for this. Is this correct?

In general I have barely any experience in Computational Linguistics so I am at a loss what methods I should choose.

What algorithms would you recommend for this problem? I want to implement it in java. Pure SQL would also be okay. Links to sources would be appreciated. Thanks.

解决方案

This is a difficult problem. A magic search keyword that might help you is "normalization" - while sometimes it means very different things ("database normalization" is unrelated, for example), you are effectively trying to normalize your input here.

A simple solution is to use Levenshtein distance with token awareness. The Python library Fuzzy Wuzzy does this and this blog post introduces how it works with motivating examples. The basic idea is simple enough you should be able to implement it in Java without much difficulty.

At a high level, the idea is to split the input into tokens on whitespace and maybe punctuation, then sort the tokens and treat them as a set, then use the set intersection size - allowing for fuzzy matching - as a metric.

Some related links:

这篇关于重复删除类似的公司名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆