NLTK中的NgramCollocationFinder [英] NgramCollocationFinder in NLTK

查看:214
本文介绍了NLTK中的NgramCollocationFinder的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ngram术语列表,我想使用NLTK工具包中提供的测试对术语进行排名.但是在NLTK.collocations中只有BigramCollocationFinder,TrigramCollocationFinder,QuadgramCollocationFinder.如果我在条款清单中有5克,6克怎么办?

I have a list of ngram terms and I want to use the tests present in the NLTK toolkit to rank the terms. But in NLTK.collocations there are only BigramCollocationFinder, TrigramCollocationFinder, QuadgramCollocationFinder. What can I do if I have a 5gram, 6gram in the terms list?

推荐答案

为了实现NGramCollocationFinder,您需要摆脱i& x变量的多样性.要摆脱它们,您需要查看所使用的模式是n项目列表的所有组合.下一步是使用此组合作为键,用字典替换变量.

In order to realise an NGramCollocationFinder you need to get rid of the multiude of i&x variables. To get rid of them you need to see that the pattern used are all combinations of a list of n items. The next step is to replace the variables with a dictionary using this combination as keys.

最后,如果组合集中存在索引,则需要基于给定的w#变量构建一些逻辑来更新每个组合.可以做到,但是我建议对于初学者使用n=3n=4进行此操作,您可以在其中验证现有类中的逻辑.当这些正确时,您可以将其用于较大的n.

Finally you need to build some logic to update each combination based upon the given w# variables if an index is present in the combination set. It can be done, but I suggest doing it for n=3 or n=4 for starters, where you can verify the logic in existing classes. When these are correct, you can use it for larger n's.

食谱"部分中有一个powerset()生成器 itertools文档的a>,可用于实现组合 1 .

There is a powerset() generator in the recipes section of the itertools documentation which you can use to realise the combinations1.

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

此处,(1,2)元组对应于iix变量,而(1,3)元组对应于ixi变量.因此,根据元组长度和不同索引的存在,可以替换所有i& x变量.

Here the (1,2) tuple corresponds to the iix variable, and the (1,3) tuple corresponds to the ixi variable. So based on the tuple length, and the presence of the different indexes it is possible to replace all the i&x variables.

实现目标所需的另一个工具是能够添加到元组.需要扩展/替换score_ngram()中的参数.这是一个有关如何添加到元组的非常简单的示例:

Another tool you need to achieve your goal, is to be able to add to tuples. This is needed to extend/replace the arguments within score_ngram(). Here is a really simple example on how to add to a tuple:

a = (1, 2)
b = a + (3, )    # Notice the trailing comma to make it one element tuple
# b is now (1, 2, 3)

其余的,正如他们所说的,留给您实现.如需有关您需要分析的部分的帮助,请参阅以下问题的我的答案:"

The rest, as they say, is left for you to implement. For some help on sections you need to analyze see my answer on the related question: "Transform QuadgramCollationFinder into PentagramCollationFinder".

1 感谢 Cyphase

1 Thanks to Cyphase describing this in this answer

这篇关于NLTK中的NgramCollocationFinder的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆