NLTK Wordnet词组同义词集 [英] NLTK Wordnet Synset for word phrase

查看:286
本文介绍了NLTK Wordnet词组同义词集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python NLTK Wordnet API.我试图找到代表一组单词的最佳同义词集.

I'm working with the Python NLTK Wordnet API. I'm trying to find the best synset that represents a group of words.

如果我需要为学校和办公用品"之类的东西找到最佳的同义词,我不确定该怎么做.到目前为止,我已经尝试过找到各个单词的同义词集,然后像这样计算最佳的最低共同上位词:

If I need to find the best synset for something like "school & office supplies", I'm not sure how to go about this. So far I've tried finding the synsets for the individual words and then computing the best lowest common hypernym like this:

def find_best_synset(category_name):
    text = word_tokenize(category_name)
    tags = pos_tag(text)

    node_synsets = []
    for word, tag in tags:
        pos = get_wordnet_pos(tag)
        if not pos:
            continue
        node_synsets.append(wordnet.synsets(word, pos=pos))

    max_score = 0
    max_synset = None
    max_combination = None
    for combination in itertools.product(*node_synsets):
        for test in itertools.combinations(combination, 2):
            score = wordnet.path_similarity(test[0], test[1])
            if score > max_score:
                max_score = score
                max_combination = test
                max_synset = test[0].lowest_common_hypernyms(test[1])
    return max_synset

但是,这种方法效果不佳,而且成本很高.有什么方法可以找出哪个同义词集最能代表多个单词?

However this doesn't work very well plus it is very costly. Are there any ways to figure out which synset best represents multiple words together?

感谢您的帮助!

推荐答案

除了我在评论中已经说过的,我认为选择最佳别名的方式可能存在缺陷.您最终得到的同义词集不是 all 单词的最低通用同义词,而是只有其中两个单词的最低同义词.

Apart from what I said in the comments already, I think the way you select the best hyperonym might be flawed. The synset you end up with is not the lowest common hyperonym of all words, but only that of two of them.

我们以您的学校和办公用品"为例.对于表达式中的每个单词,您都有许多同义词集.因此,变量node_synsets将类似于以下内容:

Let's stick with your example of "school & office supplies". For each word in the expression you get a number of synsets. So the variable node_synsets will look something like the following:

[[school_1, school_2], [office_1, office_2, office_3], [supply_1]]

在此示例中,有6种方式将每个同义词集与任何其他同义词集组合:

In this example, there are 6 ways to combine each synset with any of the others:

[(school_1, office_1, supply_1),
 (school_1, office_2, supply_1),
 (school_1, office_3, supply_1),
 (school_2, office_1, supply_1),
 (school_2, office_2, supply_1),
 (school_2, office_3, supply_1)]

这些三元组是您在外部for循环(使用itertools.product)中进行迭代的内容.如果表达式有4个单词,您将遍历四倍,其中5个是五元,依此类推.

These triples are what you iterate over in the outer for loop (with itertools.product). If the expression has 4 words, you would iterate over quadruples, with 5 it's quintuples, etc.

现在,通过内部的for循环,您可以将每个三元组配对.第一个是:

Now, with the inner for loop, you pair off each triple. The first one is:

[(school_1, office_1),
 (school_1, supply_1),
 (office_1, supply_1)]

...,然后确定每对中的最低别名.因此,最终您得到了school_2office_1的最低别名,它们可能是某种形式的机构.这可能不是很有意义,因为它没有考虑到最后一个单词的任何同义词集.

... and you determine the lowest hyperonym among each pair. So in the end you get the lowest hyperonym of, say, school_2 and office_1, which might be some kind of institution. This is probably not very meaningful, as it doesn't consider any synset of the last word.

也许您应该尝试在它们的同义词集的每种组合中找到所有三个单词的最低通用同义词,并选择其中得分最高的一个.

Maybe you should try to find the lowest common hyperonym of all three words, in each combination of their synsets, and take the one scoring best among them.

这篇关于NLTK Wordnet词组同义词集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆