Gensim短语用法以过滤n-gram [英] Gensim Phrases usage to filter n-grams
问题描述
我正在使用Gensim短语来识别文本中的重要n-gram,如下所示.
bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)
for sent in documents:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
但是,这会检测到无趣的n-gram,例如special issue
,important matter
,high risk
等.我特别想检测文本中的概念,例如machine learning
,human computer interaction
等.>
是否有一种方法可以阻止短语检测到不感兴趣的n-gram,就像我在示例中所提到的那样?
Phrases
具有可配置的threshold
参数,该参数调整统计截止值以将单词对提升为短语. (阈值越高,表示成对的短语就越少.)
您可以对其进行调整,以使其更大比例的提升短语与您自己对有趣"短语的直觉相匹配–但是该类仍在使用相当粗略的方法,对语法或领域知识一无所知语料库中有什么.因此,获得所有/大部分所需短语的任何值都可能包括许多无趣的短语,反之亦然.
如果您具有先验知识,即某些单词组很重要,则可以在基于并列统计的Phrases
处理之前(或代替此过程)自行对语料库进行预处理,以将其组合为单个标记.
I am using Gensim Phrases to identify important n-grams in my text as follows.
bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)
for sent in documents:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
However, this detects uninteresting n-grams such as special issue
, important matter
, high risk
etc. I am particularly, interested in detecting concepts in the text such as machine learning
, human computer interaction
etc.
Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?
Phrases
has a configurable threshold
parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)
You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.
If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases
process.
这篇关于Gensim短语用法以过滤n-gram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!