Gensim短语用法以过滤n-gram [英] Gensim Phrases usage to filter n-grams

查看:607
本文介绍了Gensim短语用法以过滤n-gram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Gensim短语来识别文本中的重要n-gram,如下所示.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

但是,这会检测到无趣的n-gram,例如special issueimportant matterhigh risk等.我特别想检测文本中的概念,例如machine learninghuman computer interaction等.

是否有一种方法可以阻止短语检测到不感兴趣的n-gram,就像我在示例中所提到的那样?

解决方案

Phrases具有可配置的threshold参数,该参数调整统计截止值以将单词对提升为短语. (阈值越高,表示成对的短语就越少.)

您可以对其进行调整,以使其更大比例的提升短语与您自己对有趣"短语的直觉相匹配–但是该类仍在使用相当粗略的方法,对语法或领域知识一无所知语料库中有什么.因此,获得所有/大部分所需短语的任何值都可能包括许多无趣的短语,反之亦然.

如果您具有先验知识,即某些单词组很重要,则可以在基于并列统计的Phrases处理之前(或代替此过程)自行对语料库进行预处理,以将其组合为单个标记.

I am using Gensim Phrases to identify important n-grams in my text as follows.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.

Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?

解决方案

Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)

You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.

If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.

这篇关于Gensim短语用法以过滤n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆