使用python计算文件中的bigrams(两个词对) [英] Counting bigrams (pair of two words) in a file using python

查看:31
本文介绍了使用python计算文件中的bigrams(两个词对)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 python 计算文件中所有二元组(相邻单词对)的出现次数.在这里,我正在处理非常大的文件,所以我正在寻找一种有效的方法.我尝试对文件内容使用带有正则表达式 "\w+\s\w+" 的计数方法,但事实证明它并不有效.

例如假设我想从文件 a.txt 中计算 bigram 的数量,该文件具有以下内容:

快的人没有意识到自己的速度,快的人被撞了"

对于上面的文件,bigram 集和它们的计数将是:

(the,quick) = 2(快速,人)= 2(人,做了)= 1(做了,没有)= 1(不是,意识到)= 1(意识到,他的)= 1(他的,速度)= 1(速度,和)= 1(and,the) = 1(人,碰撞)= 1

我在 Python 中遇到过一个 Counter 对象的例子,它用于计算 unigrams(单个单词).它还使用正则表达式方法.

示例如下:

<预><代码>>>># 找出哈姆雷特中最常见的十个词>>>进口重新>>>从集合导入计数器>>>words = re.findall('\w+', open('a.txt').read())>>>打印计数器(字)

上面代码的输出是:

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),('意识到', 1), ('his', 1), ('speed', 1), ('bumped', 1)]

我想知道是否可以使用 Counter 对象来获取二元组的计数.除了 Counter 对象或正则表达式之外的任何方法也将受到赞赏.

解决方案

一些 itertools 魔法:

<预><代码>>>>进口重新>>>从 itertools 导入 islice, izip>>>words = re.findall("\w+",快的人没有意识到自己的速度,快的人撞到了")>>>打印计数器(izip(words, islice(words, 1, None)))

输出:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,('意识到', '他的'): 1})

奖金

获取任何 n-gram 的频率:

from itertools import tee, islicedef ngrams(lst, n):tlst = lst为真:a, b = tee(tlst)l = 元组(islice(a, n))如果 len(l) == n:产量 l下一个(二)tlst = b别的:休息>>>计数器(ngrams(words, 3))

输出:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,('意识到', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

这也适用于惰性迭代器和生成器.因此,您可以编写一个生成器,逐行读取文件,生成单词,并将其传递给 ngarms 以懒惰地使用,而无需读取内存中的整个文件.

I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.

e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:

"the quick person did not realize his speed and the quick person bumped "

For above file, the bigram set and their count will be :

(the,quick) = 2
(quick,person) = 2
(person,did) = 1
(did, not) = 1
(not, realize) = 1
(realize,his) = 1
(his,speed) = 1
(speed,and) = 1
(and,the) = 1
(person, bumped) = 1

I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.

The example goes like this:

>>> # Find the ten most common words in Hamlet
>>> import re
>>> from collections import Counter
>>> words = re.findall('\w+', open('a.txt').read())
>>> print Counter(words)

The output of above code is :

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),
 ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.

解决方案

Some itertools magic:

>>> import re
>>> from itertools import islice, izip
>>> words = re.findall("\w+", 
   "the quick person did not realize his speed and the quick person bumped")
>>> print Counter(izip(words, islice(words, 1, None)))

Output:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, 
  ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, 
  ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, 
  ('realize', 'his'): 1})

Bonus

Get the frequency of any n-gram:

from itertools import tee, islice

def ngrams(lst, n):
  tlst = lst
  while True:
    a, b = tee(tlst)
    l = tuple(islice(a, n))
    if len(l) == n:
      yield l
      next(b)
      tlst = b
    else:
      break

>>> Counter(ngrams(words, 3))

Output:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, 
  ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, 
  ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, 
  ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, 
  ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms to consume lazily without reading the whole file in memory.

这篇关于使用python计算文件中的bigrams(两个词对)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆