匹配产品字符串的最佳机器学习技术 [英] Best machine learning technique for matching product strings

查看:109
本文介绍了匹配产品字符串的最佳机器学习技术的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个难题...

我有两个具有相同50000+种电子产品的数据库,我想将一个数据库中的产品与另一个数据库中的产品进行匹配.但是,产品名称并不总是相同的.我已经尝试过使用Levenshtein距离来测量字符串的相似性,但是这没有用.例如,

I have two databases of the same 50000+ electronic products and I want to match products in one database to those in the other. However, the product names are not always identical. I've tried using the Levenshtein distance for measuring the string similarity however this hasn't worked. For example,

-LG 42CS560 42-Inch 1080p 60Hz LCD HDTV
-LG 42 Inch 1080p LCD HDTV

这些项目相同,但是它们的产品名称相差很大.

These items are the same, yet their product names vary quite a lot.

另一方面...

-LG 42 Inch 1080p LCD HDTV
-LG 50 Inch 1080p LCD HDTV

这些是具有相似产品名称的不同产品.

These are different products with very similar product names.

我应该如何解决这个问题?

How should I tackle this problem?

推荐答案

我的第一个想法是尝试将名称解析为功能描述(公司LG,大小42 Inch,分辨率1080p,类型LCD HDTV).然后,您可以将这些描述相互匹配以实现兼容性;省略产品编号是可以的,但是尺寸不同是很糟糕的.简单的共有属性"兼容就足够了,或者您可能必须编写/学习有关允许不同属性有多少不同的规则,等等.

My first thought is to try to parse the names into a description of features (company LG, size 42 Inch, resolution 1080p, type LCD HDTV). Then you can match these descriptions against each other for compatibility; it's okay to omit a product number but bad to have different sizes. Simple are-the-common-attributes-compatible might be enough, or you might have to write / learn rules about how much different attributes are allowed to differ and so on.

根据您拥有多少种不同的产品以及所列名称的不同,我实际上可能首先从手动定义一组属性开始,甚至可能只是添加特定的单词/正则表达式来匹配它们,反复查看是什么.到目前为止,尚未对此进行解析,并为此添加了规则.我想在一个词汇项可能属于多个属性方面并没有很多歧义,尽管我看不到您的数据库,但我想我也不知道.

Depending on how many different kinds of products you have and how different the listed names are, I might actually start by manually defining a set of attributes and possibly even just adding specific words / regexes to match them, iteratively seeing what isn't been parsed so far and adding rules for that. I'd imagine there's not a lot of ambiguity in terms of one vocabulary item possibly belonging to multiple attributes, though without seeing your database I guess I don't know.

如果这不可行,则此提取类似于半监督词性标记.但是,这有点不同,因为我认为词汇表比典型的解析要受限制得多,并且产品名称的空间更具有层次性:resolution标签仅适用于某些种类的产品.我对那篇文学不是很熟悉.可能会有一些想法可以使用.

If that's not going to be feasible, this extraction is kind of analogous to semi-supervised part-of-speech tagging. It's somewhat different, though, in that I imagine the vocabulary is much more limited than typical parsing, and in that the space of product names is more heirarchical: the resolution tag only applies to certain kinds of products. I'm not very familiar with that literature; there might be some ideas you could use.

这篇关于匹配产品字符串的最佳机器学习技术的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆