使用python计算文件中的bigrams(两个词对) [英] Counting bigrams (pair of two words) in a file using python
问题描述
我想使用 python 计算文件中所有二元组(相邻单词对)的出现次数.在这里,我正在处理非常大的文件,所以我正在寻找一种有效的方法.我尝试对文件内容使用带有正则表达式 "\w+\s\w+" 的计数方法,但事实证明它并不有效.
例如假设我想从文件 a.txt 中计算 bigram 的数量,该文件具有以下内容:
快的人没有意识到自己的速度,快的人被撞了"
对于上面的文件,bigram 集和它们的计数将是:
(the,quick) = 2(快速,人)= 2(人,做了)= 1(做了,没有)= 1(不是,意识到)= 1(意识到,他的)= 1(他的,速度)= 1(速度,和)= 1(and,the) = 1(人,碰撞)= 1
我在 Python 中遇到过一个 Counter 对象的例子,它用于计算 unigrams(单个单词).它还使用正则表达式方法.
示例如下:
<预><代码>>>># 找出哈姆雷特中最常见的十个词>>>进口重新>>>从集合导入计数器>>>words = re.findall('\w+', open('a.txt').read())>>>打印计数器(字)上面代码的输出是:
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),('意识到', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
我想知道是否可以使用 Counter 对象来获取二元组的计数.除了 Counter 对象或正则表达式之外的任何方法也将受到赞赏.
一些 itertools
魔法:
输出:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,('意识到', '他的'): 1})
奖金
获取任何 n-gram 的频率:
from itertools import tee, islicedef ngrams(lst, n):tlst = lst为真:a, b = tee(tlst)l = 元组(islice(a, n))如果 len(l) == n:产量 l下一个(二)tlst = b别的:休息>>>计数器(ngrams(words, 3))
输出:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,('意识到', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
这也适用于惰性迭代器和生成器.因此,您可以编写一个生成器,逐行读取文件,生成单词,并将其传递给 ngarms
以懒惰地使用,而无需读取内存中的整个文件.
I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.
e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:
"the quick person did not realize his speed and the quick person bumped "
For above file, the bigram set and their count will be :
(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1
I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.
The example goes like this:
>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)
The output of above code is :
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.
Some itertools
magic:
>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+",
"the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))
Output:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,
('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,
('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,
('realize', 'his'): 1})
Bonus
Get the frequency of any n-gram:
from itertools import tee, islice
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
>>> Counter(ngrams(words, 3))
Output:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,
('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,
('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,
('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,
('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms
to consume lazily without reading the whole file in memory.
这篇关于使用python计算文件中的bigrams(两个词对)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!