Gensim短语用法以过滤n-gram [英] Gensim Phrases usage to filter n-grams

查看：607 发布时间：2020/5/18 1:05:15 python nlp word2vec gensim

本文介绍了Gensim短语用法以过滤n-gram的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Gensim短语来识别文本中的重要n-gram，如下所示.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

但是，这会检测到无趣的n-gram，例如special issue，important matter，high risk等.我特别想检测文本中的概念，例如machine learning，human computer interaction等.

是否有一种方法可以阻止短语检测到不感兴趣的n-gram，就像我在示例中所提到的那样?

解决方案

Phrases具有可配置的threshold参数，该参数调整统计截止值以将单词对提升为短语. (阈值越高，表示成对的短语就越少.)

您可以对其进行调整，以使其更大比例的提升短语与您自己对有趣"短语的直觉相匹配–但是该类仍在使用相当粗略的方法，对语法或领域知识一无所知语料库中有什么.因此，获得所有/大部分所需短语的任何值都可能包括许多无趣的短语，反之亦然.

如果您具有先验知识，即某些单词组很重要，则可以在基于并列统计的Phrases处理之前(或代替此过程)自行对语料库进行预处理，以将其组合为单个标记.

I am using Gensim Phrases to identify important n-grams in my text as follows.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.

Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?

解决方案

Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)

You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.

If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.

这篇关于Gensim短语用法以过滤n-gram的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Gensim短语用法以过滤n-gram [英] Gensim Phrases usage to filter n-grams

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Gensim短语用法以过滤n-gram [英] Gensim Phrases usage to filter n-grams

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭