模糊分组依据,将类似的词语 [英] Fuzzy Group By, Grouping Similar Words

查看:176
本文介绍了模糊分组依据,将类似的词语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题在这里问过

<一个href="http://stackoverflow.com/questions/6579263/what-is-a-good-strategy-to-group-similar-words">What是一个很好的策略,将类似的话吗?

但没有明确的答案就如何组的项目。基于difflib的溶液基本上是搜索,对于给定项目,difflib可以返回最相似的单词出名单。但是,这怎么能用于分组?

but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?

我想,以减少

['ape', 'appel', 'apple', 'peach', 'puppy']

['ape', 'appel', 'peach', 'puppy']

['ape', 'apple', 'peach', 'puppy']

一个想法我想是,针对每个项目,遍历列表,如果get_close_matches返回多个匹配,使用它,如果不能保持的话原样。这部分工作,但它同时也意味着苹果为APPEL,然后APPEL对于苹果来说,这些话只会切换的地方,没有什么会改变。

One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.

我会AP preciate任何指针,图书馆的姓名等。

I would appreciate any pointers, names of libraries, etc.

注:也是在性能方面,我们有30万项的列表,并get_close_matches似乎有点慢。有谁知道一个C / ++基础的解决方案在那里呢?

Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?

谢谢

请注意:进一步的调查显示kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要中心,它需要/使用数据点自己为中心的(​​这点称为medoids,故名) 。在字分组的情况下,medoid将是该组/集群的再presentative元素。

Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.

推荐答案

您需要正常化的群体。在每一组中,挑选重新presents组一个词或编码。然后组的话由他们再presentative。

You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.

一些可能的方式:

  • 选择第一个遇到的字。
  • 挑词典第一个单词。
  • 导出一种模式,所有的话。
  • 在选择一个唯一索引。
  • 使用同音的模式。
  • Pick the first encountered word.
  • Pick the lexicographic first word.
  • Derive a pattern for all the words.
  • Pick an unique index.
  • Use the soundex as pattern.

分组的话可能很难,但。如果A是类似于B和B是类似于C,A和C是不一定彼此相似。如果B是重新presentative,A和C二者可以被包括在组中。但是,如果A或C是重新presentative,其他的不能被包括在内。

Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.

去的第一选择(第一次遇到字):

Going by the first alternative (first encountered word):

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

示例:

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

输出:

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

这篇关于模糊分组依据,将类似的词语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆