如何在文本中找到搭配,python [英] How to find collocations in text, python
问题描述
搭配是经常出现在一起的一系列单词。
python内置func bigrams返回单词对。
>>> ('more','is','said','than','done'])
[('more','is'),('is','said'),( 'than'),('than','done')]
>>>
剩下的就是根据单词的频率来查找更频繁出现的bigrams。任何想法如何把它放在代码中? 解决方案
试试 NLTK 。您大多会对 nltk.collocations.BigramColocationFinder
感兴趣,但这里有一个快速演示,向您展示如何开始:
>>> import nltk
>>> def tokenize(句子):
...用于发送nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(已发送):
.. 。yield word
...
>>> nltk.Text(tkn for tokenize('mary had a little lamb。'))
>>> )
text = nltk.Text(tkn for tokenize('mary had a little lamb。')) >
>>> text.collocations(num = 20)
构建搭配列表
How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder
, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
这篇关于如何在文本中找到搭配,python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!