使用Gensim提取短语时出错 [英] Error in extracting phrases using Gensim

查看:107
本文介绍了使用Gensim提取短语时出错的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Gensim中的短语来获取句子中的二元组。

I am trying to get the bigrams in the sentences using Phrases in Gensim as follows.

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]

sentence_stream = [doc.split(" ") for doc in documents]
#print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)

for sent in sentence_stream:
    tokens_ = bigram_phraser[sent]
    print(tokens_)

即使它捕获了 new, york为 new york,也没有捕获 machine,学习为机器学习

Even though it catches "new", "york" as "new york", it does not catch "machine", learning as "machine learning"

但是,在 Gensim网站上显示的示例,他们能够捕捉到机器,学习一词

However, in the example shown in Gensim Website they were able to catch the words "machine", "learning" as "machine learning".

请在上面的示例中让我知道如何将机器学习作为二元组

Please let me know how to get "machine learning" as a bigram in the above example

推荐答案

gensim 短语所使用的技术纯粹基于共现统计:单词出现的频率与仅在受 min_count 影响且与阈值值进行比较的公式中。

The technique used by gensim Phrases is purely based on statistics of co-occurrences: how often words appear together, versus alone, in a formula also affected by min_count and compared against the threshold value.

这仅仅是因为您的训练集新和约克彼此并发两次,而其他词(例如机器和学习)仅一次彼此并存, new_york就变成了二元组,而其他配对则没有。而且,即使您确实找到了 min_count threshold 的组合,也可以将 machine_learning提升为一个二元组,它也会将其他所有出现的二元组配对在一起,这可能不是您想要的。

It is only because your training set has 'new' and 'york' occur alongside each other twice, while other words (like 'machine' and 'learning') only occur alongside each other once, that 'new_york' becomes a bigram, and other pairings do not. What's more, even if you did find a combination of min_count and threshold that would promote 'machine_learning' to a bigram, it would also pair together every other bigram-that-appears-once – which is probably not what you want.

真的,要变得很好这些统计技术的结果,您需要大量不同的,真实的数据。 (由于玩具大小的原因,玩具大小的示例可能从表面上获得成功或失败。)

Really, to get good results from these statistical techniques, you need lots of varied, realistic data. (Toy-sized examples may superficially succeed, or fail, for superficial toy-sized reasons.)

即使那样,它们也会倾向于错过人们认为合理的组合,并做出一个人不会的组合。为什么?因为我们的头脑中有很多复杂的方法(包括语法和现实世界的知识)来决定何时成堆的单词代表一个概念。

Even then, they will tend to miss combinations a person would consider reasonable, and make combinations a person wouldn't. Why? Because our minds have much more sophisticated ways (including grammar and real-world knowledge) for deciding when clumps of words represent a single concept.

因此,即使有了更多更好的数据,也要为无意义的n-gram做好准备。调整或判断模型是否在总体上提高了您的目标,而不是对您的敏感性进行任何单点检查或临时检查。

So even with more better data, be prepared for nonsensical n-grams. Tune or judge the model on whether it is overall improving on your goal, not any single point or ad-hoc check of matching your own sensibility.

(关于引用的gensim文档注释,我很确定,如果仅对其中列出的两个句子尝试短语,它将找不到任何所需的短语–不是'new_york'或作为一个比喻性的例子,省略号 ... 表示训练集更大,结果表明多余的未显示文本很重要,这仅仅是因为添加到代码中的第三个句子中检测到 new_york。如果添加了类似的示例以使 machine_learning看起来更像是统计上无关的配对,那么您的代码也可以推广 machine_learning。)

(Regarding the referenced gensim documentation comment, I'm pretty sure that if you try Phrases on just the two sentences listed there, it won't find any of the desired phrases – not 'new_york' or 'machine_learning'. As a figurative example, the ellipses ... imply the training set is larger, and the results indicate that the extra unshown texts are important. It's just because of the 3rd sentence you've added to your code that 'new_york' is detected. If you added similar examples to make 'machine_learning' look more like a statistically-outlying pairing, your code could promote 'machine_learning', too.)

这篇关于使用Gensim提取短语时出错的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆