手动将搭配添加到gensim模板 [英] Manually add collocations to gensim phraser

查看:41
本文介绍了手动将搭配添加到gensim模板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在语言学论文上进行主题建模,并且正在使用Gensim短语来识别频繁出现的搭配.我希望能够将术语标记为"do-support"和"it-clefts"作为一个词,因为它们是特定的语言术语.但是,如果我在取出停用词后创建Gensim模型,则不会找到这些搭配(因为它们包含停用词),如果我在取出停用词(或不包含'it'或'do'的停用词)后进行建模,确定了很多不相关的搭配.有没有办法手动添加Gensim短语应识别为搭配的短语?谢谢!

I am doing topic modelling on linguistics papers and I am using the Gensim Phrases to identify frequent collocations. I want to be able to mark terms as 'do-support' and 'it-clefts' as one single word, since they are specific linguistic terminology. However, if I make the Gensim model after taking out stopwords, these collocations will not be found (since they contain stopwords), if I make the model after taking out stopwords (or stopwords not including 'it' or 'do'), it identifies a whole lot of irrelevant collocations. Is there a way to manually add phrases that should be recognised as collocations by the Gensim Phrases? Thanks!

推荐答案

短语类无法添加所需的双字母组.它的技术通常不希望在处理之前删除停用词".

The Phrases class doesn't have the ability to add desired bigrams. Its technique generally does not expect 'stop words' to have been removed before processing.

您可以通过尝试不同的阈值"和最小计数"值来调整短语的行为.

You could potentially tune Phrases behavior by trying different 'threshold' and 'min_count' values.

如果您发现某些设置,它们连接了期望短语,但是还有一些不需要的短语仍然符合相同的统计阈值,尽管其中一些不直观,但这可能不是一个很大的危害.短语.所有这些统计技术都是不精确的,通常最好根据最终目标的定量目标来判断-而不是从临时审查中发现的任何任意奇数/拐角处.

If you find some settings are connecting desired-phrases, but then also some unwanted phrases that still fit the same statistical thresholds, maybe that's not a great harm, despite the non-intuitiveness of some of the phrases. All these statistical techniques are imprecise, and often best judged by their end results on quantitative goals – rather than any arbitrary oddities/corner-cases found from an ad-hoc review.

如果您确实想深入研究代码以添加强制某些双字母组的功能,则可以通过gensim的 phrases.py 中的 Phraser 实用工具类来简化操作>模块.以一些额外的前期计算为代价,它将 Phrases 数据简化为较小的结构,只有二元组随后可以通过组合阈值.这样,它可以节省一些内存,并且可以更快地执行稍后的语料转换,但是如果您只保留 Phraser ,则您将无法尝试其他阈值/min_counts低于使用的阈值/min_counts.它的创造.但是,与篡改完整的 Phrases 模型相比,创建后,您可能更容易将额外的手工选择的二元语法强制放入其结构中.

If you did want to dig into the code to add the ability to force certain bigrams, it might be easier via the Phraser utility class, also in gensim's phrases.py module. At the cost of some extra up-front calculation, it reduces the Phrases data to a smaller structure, with just the bigrams that would later pass the combination-threshold. As such, it saves a bit of memory, and performs later corpus-transformations a little faster, but if you only keep the Phraser, you lose the ability to try other thresholds/min_counts below what was used in its creation. But you could potentially force extra hand-chosen bigrams into its structures, after creation, more easily than tampering with the full Phrases model.

更新(2021年4月):从Gensim-4.0开始, Phraser 类已重命名为 FrozenPhrases ,以便与培训更好地区分开短语类.此外,在项目问题中的建议可能提供了-有效地强迫"某些二元短语的方式总是得到推广.具体来说:

Update (April 2021): Starting in Gensim-4.0, the Phraser class has been renamed FrozenPhrases, for better distinction from the training Phrases class. Additionally, a suggestion in a project issue provides a probably-effective way to 'force' certain bigram-phrases to always be promoted. Specifically:

phrases = Phrases(…)  # do customary training/etc
frozen_phrases = phrases.freeze()  # freeze bigrams' scores for compactness/efficiency
frozen_phrases.phrasegrams['my_phrase'] = float('inf')  # set the desired phrase to infinite score

这篇关于手动将搭配添加到gensim模板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆