算法基于关键词匹配路口 [英] Algorithms for matching based on keywords intersection

查看:132
本文介绍了算法基于关键词匹配路口的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有买家和卖家正试图找到对方的市场。买家可以标记他们的关键字的需求;卖家可以做相同的,他们是卖什么。我很感兴趣,在他们的相关性方面对他们的两个关键词组的基础上,特别是买家寻找算法(县),秩卖家。

Suppose we have buyers and sellers that are trying to find each other in a market. Buyers can tag their needs with keywords; sellers can do the same for what they are selling. I'm interested in finding algorithm(s) that rank-order sellers in terms of their relevance for a particular buyer on the basis of their two keyword sets.

下面是一个例子:

buyer_keywords = {"furry", "four legs", "likes catnip", "has claws"} 

,然后我们有我们需要他们的相关性方面的排名顺序的两个潜在卖家:

and then we have two potential sellers that we need to rank order in terms of their relevance:

seller_keywords[1] = {"furry", "four legs", "arctic circle", "white"}
seller_keywords[2] = {"likes catnip", "furry", 
                      "hates mice", "yarn-lover", "whiskers"}

如果我们只是使用关键字的交集,我们没有得到太多的歧视:两个相交的2个关键字。如果我们把交点计数由并集的大小,卖方2实际执行,因为较大数目的关键字的差。这似乎引入自动处罚任何方法不改正的关键字设置的大小(我们绝对不希望处罚添加关键字)。

If we just use the intersection of keywords, we do not get much discrimination: both intersect on 2 keywords. If we divide the intersection count by the size of the set union, seller 2 actually does worse because of the greater number of keywords. This would seem to introduce an automatic penalty for any method not correcting keyword set size (and we definitely don't want to penalize adding keywords).

要放多一点的结构上的问题,假设我们有关键字的属性(其总结为1每个卖家),如,:强度的一些真实测量

To put a bit more structure on the problem, suppose we have some truthful measure of intensity of keyword attributes (which have to sum to 1 for each seller), e.g.,:

seller_keywords[1] = {"furry":.05, 
                      "four legs":.05, 
                      "arctic circle":.8, 
                      "white":.1}

seller_keywords[2] = {"likes catnip":.5, 
                      "furry":.4, 
                      "hates mice":.02, 
                      "yarn-lover":.02, 
                      "whiskers":.06}

现在,我们可以总结的命中值:所以现在卖家1只得到.1分的高分,而卖方得到2 .9分。到目前为止,一切都很好,但现在我们可能会得到第三个卖家非常有限,非描述的关键字设置:

Now we could sum up the value of hits: so now Seller 1 only gets a score of .1, while Seller 2 gets a score of .9. So far, so good, but now we might get a third seller with a very limited, non-descriptive keyword set:

seller_keywords[3] = {"furry":1}

这弹射他们对他们的唯一关键字,这是不好的任何撞在上面。

This catapults them to the top for any hit on their sole keyword, which isn't good.

反正,我的猜测(和希望)的是,这是一个相当普遍的问题,并且存在不同的解决方案的算法与已知的优点和局限性。这可能是一些覆盖CS101,所以我觉得一个好的答案,这个问题可能只是一个链接到相关的参考资料。

Anyway, my guess (and hope) is that this is a fairly general problem and that there exist different algorithmic solutions with known strengths and limitations. This is probably something covered in CS101, so I think a good answer to this question might just be a link to the relevant references.

推荐答案

我想你想使用余弦相似;它是那么远让你pretty的作为第一劈一个基本技术。直观地说,你创建一个向量,其中每一个你知道的标签都有一个特定的指数:

I think you're looking to use cosine similarity; it's a basic technique that gets you pretty far as a first hack. Intuitively, you create a vector where every tag you know about has a particular index:

terms[0] --> aardvark
terms[1] --> anteater
...
terms[N] --> zuckerberg

然后创建在这个空间里的每个人向量:

Then you create vectors in this space for each person:

person1[0] = 0     # this person doesn't care about aardvarks
person1[1] = 0.05  # this person cares a bit about anteaters
...
person1[N] = 0

每个人现在在此N维空间中的向量。然后,您可以用余弦相似度来计算对它们之间的相似性。 Calculationally,这是基本相同的,要求这两个向量之间的角度的。你想要一个余弦接近1,这意味着该载体是大致共线 - 它们具有对大多数尺寸的相似值

Each person is now a vector in this N-dimensional space. You can then use cosine similarity to calculate similarity between pairs of them. Calculationally, this is basically the same of asking for the angle between the two vectors. You want a cosine close to 1, which means that the vectors are roughly collinear -- that they have similar values for most of the dimensions.

要改善这个指标,你可能要使用 TF-IDF 权在您的载体的元素。 TF-IDF将淡化流行方面的重要性(如,'iPhone')和促进不受欢迎的条款,这个人似乎特别关联的重要性。

To improve this metric, you may want to use tf-idf weighting on the elements in your vector. Tf-idf will downplay the importance of popular terms (e.g, 'iPhone') and promote the importance of unpopular terms that this person seems particularly associated with.

结合TF-IDF加权余弦相似度做的好于大多数应用是这样的。

Combining tf-idf weighting and cosine similarity does well for most applications like this.

这篇关于算法基于关键词匹配路口的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆