分组相似的字符串 [英] Grouping Similar Strings

查看:90
本文介绍了分组相似的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试分析大量搜索词,以至于每个词都讲得很少.就是说,我想对这些术语进行分组,因为我认为相似的术语应具有相似的效力.例如,

I'm trying to analyze a bunch of search terms, so many that individually they don't tell much. That said, I'd like to group the terms because I think similar terms should have similar effectiveness. For example,

Term               Group
NBA Basketball     1
Basketball NBA     1
Basketball         1
Baseball           2

这是一个人为的示例,但希望它可以解释我正在尝试做的事情.那么,做我所描述的最佳方法是什么?我以为nltk可能与此类似,但是我对此几乎不熟悉.

It's a contrived example, but hopefully it explains what I'm trying to do. So then, what is the best way to do what I've described? I thought the nltk may have something along those lines, but I'm only barely familiar with it.

谢谢

推荐答案

您需要将这些术语归类,对于相似性指标,我建议

You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).

nltk 似乎提供了骰子,作为

nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.

import sys, operator

def tokenize(s, glen):
  g2 = set()
  for i in xrange(len(s)-(glen-1)):
    g2.add(s[i:i+glen])
  return g2

def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))

def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))

def main():
  GRAM_LEN = 4
  scores = {}
  for i in xrange(1,len(sys.argv)):
    for j in xrange(i+1, len(sys.argv)):
      s1 = sys.argv[i]
      s2 = sys.argv[j]
      score = dice(GRAM_LEN, s1, s2)
      scores[s1+":"+s2] = score
  for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
    print item

在使用您的字符串运行该程序时,会产生以下相似性分数:

When this program is run with your strings, the following similarity scores are produced:

./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"

('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)

至少对于此示例而言,篮球棒球字词之间的边距应足以将它们分为不同的组.另外,您也可以在代码中使用阈值来更直接地使用相似性评分.

At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

这篇关于分组相似的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆