模糊分组依据,对相似词进行分组 [英] Fuzzy Group By, Grouping Similar Words

查看:34
本文介绍了模糊分组依据,对相似词进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题之前在这里问过

this question is asked here before

将相似词分组的好策略是什么?

但没有给出关于如何分组"项目的明确答案.基于 difflib 的解决方案基本上是搜索,对于给定的项目,difflib 可以从列表中返回最相似的单词.但这如何用于分组?

but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

['ape', 'appel', 'peach', 'puppy']

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是,对于每个项目,遍历列表,如果 get_close_matches 返回多个匹配项,则使用它,如果不保持原样.这部分奏效了,但它可以建议 apple for appel,然后 appel for apple,这些词只会交换位置,什么都不会改变.

One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.

如有任何指针、库名称等,我将不胜感激.

I would appreciate any pointers, names of libraries, etc.

注意:同样在性能方面,我们有一个 300,000 项的列表,get_close_matches 似乎有点慢.有谁知道基于 C/++ 的解决方案?

Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?

谢谢,

注意:进一步调查显示kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要中心",它需要/使用数据点本身作为中心(这些点被称为medoids,因此得名).在词组情况下,medoid 将是该组/集群的代表元素.

Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.

推荐答案

您需要对组进行规范化.在每组中,选择一个代表该组的单词或编码.然后按他们的代表将单词分组.

You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.

一些可能的方法:

  • 选择第一个遇到的词.
  • 选择字典序的第一个单词.
  • 为所有单词推导出一个模式.
  • 选择一个唯一索引.
  • 使用 soundex 作为模式.
  • Pick the first encountered word.
  • Pick the lexicographic first word.
  • Derive a pattern for all the words.
  • Pick an unique index.
  • Use the soundex as pattern.

不过,对单词进行分组可能很困难.如果A与B相似,B与C相似,则A和C不一定相似.如果 B 是代表,则 A 和 C 都可以包含在该组中.但如果A或C是代表,则另一个不能包括在内.

Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.

按照第一个选择(第一个遇到的词):

Going by the first alternative (first encountered word):

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

示例:

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

输出:

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

这篇关于模糊分组依据,对相似词进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆