了解二元组和三元组的NLTK搭配评分 [英] Understanding NLTK collocation scoring for bigrams and trigrams

查看:251
本文介绍了了解二元组和三元组的NLTK搭配评分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:

我正在尝试比较单词对,以查看美式英语中哪个对比另一个对更可能出现.我的计划是/将要使用NLTK中的搭配功能来对单词对进行评分,最有可能获得更高的评分对.

方法:

我使用NLTK在Python中编写了以下代码(为简洁起见,删除了多个步骤和导入内容):

bgm    = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio  )
print scored

结果:

然后,我使用2个单词对检查了结果,其中两个单词对很可能同时出现,而另一个单词不应该同时出现(烤腰果"和汽油腰果").我很惊讶地看到这些单词配对得分相同:

[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]

在测试中,我希望烤腰果"比汽油腰果"得分高.

问题:

  1. 我误解了搭配的使用吗?
  2. 我的代码不正确吗?
  3. 我是否认为分数应该是不同的?如果是,为什么?

非常感谢您提供任何信息或帮助!

解决方案

NLTK搭配文档对我来说似乎很好. http://www.nltk.org/howto/collocations.html

您需要给计分员一些实际的可观语料库以供使用.这是使用NLTK中内置的Brown语料库的工作示例.运行大约需要30秒.

import nltk.collocations
import nltk.corpus
import collections

bgm    = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio  )

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
   prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
   prefix_keys[key].sort(key = lambda x: -x[1])

print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]

输出看起来是合理的,对于棒球来说效果很好,对于医生和快乐者来说效果不佳.

doctor [('bills', 35.061321987405748), (',', 22.963930079491501), 
  ('annoys', 19.009636692022365), 
  ('had', 16.730384189212423), ('retorted', 15.190847940499127)]

baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), 
  ('park', 23.509042621473505), ('games', 23.105033513054011), 
  ("player's",    16.227872863424668)]

happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), 
 ('family', 13.734352182441569), 
 (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)

Background:

I am trying to compare pairs of words to see which pair is "more likely to occur" in US English than another pair. My plan is/was to use the collocation facilities in NLTK to score word pairs, with the higher scoring pair being the most likely.

Approach:

I coded the following in Python using NLTK (several steps and imports removed for brevity):

bgm    = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
scored = finder.score_ngrams( bgm.likelihood_ratio  )
print scored

Results:

I then examined the results using 2 word pairs, one of which should be highly likely to co-occur, and one pair which should not ("roasted cashews" and "gasoline cashews"). I was surprised to see these word pairing score identically:

[(('roasted', 'cashews'), 5.545177444479562)]
[(('gasoline', 'cashews'), 5.545177444479562)]

I would have expected 'roasted cashews' to score higher than 'gasoline cashews' in my test.

Questions:

  1. Am I misunderstanding the use of collocations?
  2. Is my code incorrect?
  3. Is my assumption that the scores should be different wrong, and if so why?

Thank you very much for any information or help!

解决方案

The NLTK collocations document seems pretty good to me. http://www.nltk.org/howto/collocations.html

You need to give the scorer some actual sizable corpus to work with. Here is a working example using the Brown corpus built into NLTK. It takes about 30 seconds to run.

import nltk.collocations
import nltk.corpus
import collections

bgm    = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(
    nltk.corpus.brown.words())
scored = finder.score_ngrams( bgm.likelihood_ratio  )

# Group bigrams by first word in bigram.                                        
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
   prefix_keys[key[0]].append((key[1], scores))

# Sort keyed bigrams by strongest association.                                  
for key in prefix_keys:
   prefix_keys[key].sort(key = lambda x: -x[1])

print 'doctor', prefix_keys['doctor'][:5]
print 'baseball', prefix_keys['baseball'][:5]
print 'happy', prefix_keys['happy'][:5]

The output seems reasonable, works well for baseball, less so for doctor and happy.

doctor [('bills', 35.061321987405748), (',', 22.963930079491501), 
  ('annoys', 19.009636692022365), 
  ('had', 16.730384189212423), ('retorted', 15.190847940499127)]

baseball [('game', 32.110754519752291), ('cap', 27.81891372457088), 
  ('park', 23.509042621473505), ('games', 23.105033513054011), 
  ("player's",    16.227872863424668)]

happy [("''", 20.296341424483998), ('Spahn', 13.915820697905589), 
 ('family', 13.734352182441569), 
 (',', 13.55077617193821), ('bodybuilder', 13.513265447290536)

这篇关于了解二元组和三元组的NLTK搭配评分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆