来自nltk模块的类似方法在不同的机器上产生不同的结果.为什么? [英] The similar method from the nltk module produces different results on different machines. Why?

查看:115
本文介绍了来自nltk模块的类似方法在不同的机器上产生不同的结果.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我教过一些入门课程,介绍如何使用Python进行文本挖掘,并且该课程使用提供的练习文本尝试了类似的方法.有些学生在text1.similar()上得到的结果与其他学生不同.

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others.

所有版本等都相同.

有人知道为什么会出现这些差异吗?谢谢.

Does anyone know why these differences would occur? Thanks.

在命令行使用的代码.

python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet

通过类似方法返回的那些术语列表因用户而异,它们有许多共同的词,但它们不是相同的列表.所有用户都使用相同的操作系统,以及相同版本的python和nltk.

Those lists of terms returned by the similar method differ from user to user, they have many words in common, but they are not identical lists. All users were using the same OS, and the same versions of python and nltk.

我希望这使问题更明确.谢谢.

I hope that makes the question clearer. Thanks.

推荐答案

在您的示例中,有40个其他单词与'monstrous'共同具有恰好一个上下文. 在 similar 函数中,Counter对象用于计算带有相似的上下文,然后打印最常见的上下文(默认为20).由于所有40个频率相同,因此顺序可以不同.

In your example there are 40 other words which have exactly one context in common with the word 'monstrous'. In the similar function a Counter object is used to count the words with similar contexts and then the most common ones (default 20) are printed. Since all 40 have the same frequency the order can differ.

来自Counter.most_common doc :

具有相等计数的元素是任意排序的

Elements with equal counts are ordered arbitrarily


我用此代码(实质上是功能代码相关部分的副本)检查了类似单词的出现频率:


I checked the frequency of the similar words with this code (which is essentially a copy of the relevant part of the function code):

from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter

word = 'monstrous'
num = 20

text1.similar(word)

wci = text1._word_context_index._word_to_contexts

if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            # print(tokenwrap(words))

print(fd)
print(len(fd))
print(fd.most_common(num))

输出:(不同的运行给我不同的输出)

Output: (different runs give different output for me)

Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})

这篇关于来自nltk模块的类似方法在不同的机器上产生不同的结果.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆