来自 nltk 模块的类似方法在不同的机器上产生不同的结果.为什么? [英] The similar method from the nltk module produces different results on different machines. Why?

查看:37
本文介绍了来自 nltk 模块的类似方法在不同的机器上产生不同的结果.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经教了一些关于使用 Python 进行文本挖掘的入门课程,课程中使用提供的练习文本尝试了类似的方法.一些学生对 text1.similar() 的结果与其他学生不同.

I have taught a few introductory classes to text mining with Python, and the class tried the similar method with the provided practice texts. Some students got different results for text1.similar() than others.

所有版本等都是一样的.

All versions and etc. were the same.

有谁知道为什么会出现这些差异?谢谢.

Does anyone know why these differences would occur? Thanks.

在命令行中使用的代码.

Code used at command line.

python
>>> import nltk
>>> nltk.download() #here you use the pop-up window to download texts
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>>>>> text1.similar("monstrous")
mean part maddens doleful gamesome subtly uncommon careful untoward
exasperate loving passing mouldy christian few true mystifying
imperial modifies contemptible
>>> text2.similar("monstrous")
very heartily so exceedingly remarkably as vast a great amazingly
extremely good sweet

类似方法返回的那些词条列表因用户而异,它们有很多共同的词,但它们并不是完全相同的列表.所有用户都使用相同的操作系统,以及相同版本的 python 和 nltk.

Those lists of terms returned by the similar method differ from user to user, they have many words in common, but they are not identical lists. All users were using the same OS, and the same versions of python and nltk.

我希望这能让问题更清楚.谢谢.

I hope that makes the question clearer. Thanks.

推荐答案

在您的示例中,还有 40 个其他词与 'monstrous' 词具有完全相同的上下文代码>.在 similar 函数中,Countercode> object 用于计算上下文相似的单词,然后打印最常见的单词(默认为 20).由于所有 40 个具有相同的频率,因此顺序可以不同.

In your example there are 40 other words which have exactly one context in common with the word 'monstrous'. In the similar function a Counter object is used to count the words with similar contexts and then the most common ones (default 20) are printed. Since all 40 have the same frequency the order can differ.

来自Counter.most_commondoc>:

From the doc of Counter.most_common:

数量相等的元素是任意排序的

Elements with equal counts are ordered arbitrarily

<小时>

我用这段代码检查了相似词的出现频率(本质上是函数代码相关部分的副本):


I checked the frequency of the similar words with this code (which is essentially a copy of the relevant part of the function code):

from nltk.book import *
from nltk.util import tokenwrap
from nltk.compat import Counter

word = 'monstrous'
num = 20

text1.similar(word)

wci = text1._word_context_index._word_to_contexts

if word in wci.conditions():
            contexts = set(wci[word])
            fd = Counter(w for w in wci.conditions() for c in wci[w]
                          if c in contexts and not w == word)
            words = [w for w, _ in fd.most_common(num)]
            # print(tokenwrap(words))

print(fd)
print(len(fd))
print(fd.most_common(num))

输出:(不同的运行给我不同的输出)

Output: (different runs give different output for me)

Counter({'doleful': 1, 'curious': 1, 'delightfully': 1, 'careful': 1, 'uncommon': 1, 'mean': 1, 'perilous': 1, 'fearless': 1, 'imperial': 1, 'christian': 1, 'trustworthy': 1, 'untoward': 1, 'maddens': 1, 'true': 1, 'contemptible': 1, 'subtly': 1, 'wise': 1, 'lamentable': 1, 'tyrannical': 1, 'puzzled': 1, 'vexatious': 1, 'part': 1, 'gamesome': 1, 'determined': 1, 'reliable': 1, 'lazy': 1, 'passing': 1, 'modifies': 1, 'few': 1, 'horrible': 1, 'candid': 1, 'exasperate': 1, 'pitiable': 1, 'abundant': 1, 'mystifying': 1, 'mouldy': 1, 'loving': 1, 'domineering': 1, 'impalpable': 1, 'singular': 1})

这篇关于来自 nltk 模块的类似方法在不同的机器上产生不同的结果.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆