如何在文本中找到搭配,python [英] How to find collocations in text, python

查看:308
本文介绍了如何在文本中找到搭配,python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你怎么在文本中找到搭配?
搭配是经常出现在一起的一系列单词。
python内置func bigrams返回单词对。

 >>> ('more','is','said','than','done'])
[('more','is'),('is','said'),( 'than'),('than','done')]
>>>

剩下的就是根据单词的频率来查找更频繁出现的bigrams。任何想法如何把它放在代码中? 解决方案

试试 NLTK 。您大多会对 nltk.collocations.BigramColocationFinder 感兴趣,但这里有一个快速演示,向您展示如何开始:

 >>> import nltk 
>>> def tokenize(句子):
...用于发送nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(已发送):
.. 。yield word
...

>>> nltk.Text(tkn for tokenize('mary had a little lamb。'))
>>> )

text = nltk.Text(tkn for tokenize('mary had a little lamb。')) >

 >>> text.collocations(num = 20)
构建搭配列表


How do you find collocations in text? A collocation is a sequence of words that occurs together unusually often. python has built-in func bigrams that returns word pairs.

>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>

What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

解决方案

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:

>>> import nltk
>>> def tokenize(sentences):
...     for sent in nltk.sent_tokenize(sentences.lower()):
...         for word in nltk.word_tokenize(sent):
...             yield word
... 

>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))

There are none in this small segment, but here goes:

>>> text.collocations(num=20)
Building collocations list

这篇关于如何在文本中找到搭配,python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆