产品名称的模糊匹配 [英] Fuzzy matching of product names

查看:45
本文介绍了产品名称的模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要自动将来自不同来源的产品名称(相机、笔记本电脑、电视等)与数据库中的规范名称进行匹配.

I need to automatically match product names (cameras, laptops, tv-s etc) that come from different sources to a canonical name in the database.

例如Canon PowerShot a20IS"来自佳能的NEW powershot A20 IS"数码相机Canon PS A20IS"应该都匹配Canon PowerShot A20 IS".我已经使用了 levenshtein distance 并添加了一些启发式方法(删除明显的常用词,为数字更改分配更高的成本等),这在一定程度上有效,但不幸的是还不够好.

For example "Canon PowerShot a20IS", "NEW powershot A20 IS from Canon" and "Digital Camera Canon PS A20IS" should all match "Canon PowerShot A20 IS". I've worked with levenshtein distance with some added heuristics (removing obvious common words, assigning higher cost to number changes etc), which works to some extent, but not well enough unfortunately.

主要问题是,即使相关关键字中的单个字母更改也会产生巨大差异,但要检测哪些是相关关键字并不容易.例如,考虑三个产品名称:
联想 T400
联想 R400
全新联想 T-400,Core 2 Duo
根据任何标准,前两个都是非常相似的字符串(好吧,在这种情况下,soundex 可能有助于区分 T 和 R,但名称也可能是 400T 和 400R),第一个和第三个彼此相距很远,因为字符串,但都是同一个产品.

The main problem is that even single-letter changes in relevant keywords can make a huge difference, but it's not easy to detect which are the relevant keywords. Consider for example three product names:
Lenovo T400
Lenovo R400
New Lenovo T-400, Core 2 Duo
The first two are ridiculously similar strings by any standard (ok, soundex might help to disinguish the T and R in this case, but the names might as well be 400T and 400R), the first and the third are quite far from each other as strings, but are the same product.

显然,匹配算法不可能是 100% 精确的,我的目标是自动匹配大约 80% 的名字并具有很高的置信度.

Obviously, the matching algorithm cannot be a 100% precise, my goal is to automatically match around 80% of the names with a high confidence.

非常感谢任何想法或参考

Any ideas or references is much appreciated

推荐答案

我认为这将归结为区分诸如 Lenovo 之类的关键字和诸如 New 之类的 chaff.

I think this will boil down to distinguishing key words such as Lenovo from chaff such as New.

我会对名称数据库进行一些分析以识别关键词.您可以使用类似于用于生成词云的代码.

I would run some analysis over the database of names to identify key words. You could use code similar to that used to generate a word cloud.

然后我会手动编辑列表以删除任何明显的草皮,比如也许 New 实际上很常见但不是关键.

Then I would hand-edit the list to remove anything obviously chaff, like maybe New is actually common but not key.

然后您将获得可用于帮助识别相似性的关键词列表.您可以将原始"名称与其关键字相关联,并在比较两个或多个原始名称的相似性(字面意思是共享关键字的百分比)时使用这些关键字.

Then you will have a list of key words that can be used to help identify similarities. You would associate the "raw" name with its keywords, and use those keywords when comparing two or more raw names for similarities (literally, percentage of shared keywords).

无论如何都不是一个完美的解决方案,但我认为您并不期待?

Not a perfect solution by any stretch, but I don't think you are expecting one?

这篇关于产品名称的模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆