真正快速计算二元数(有或没有多处理)-python [英] Counting bigrams real fast (with or without multiprocessing) - python

查看:82
本文介绍了真正快速计算二元数(有或没有多处理)-python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于 norvig.com/big.txt 中的big.txt,目标是计算二元组的运行速度非常快(想象一下,我必须重复此操作100,000次).

Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times).

根据快速/优化python中的N-gram实现 ,这样提取二元组将是最佳选择:

According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal:

_bigrams = zip(*[text[i:] for i in range(2)])

如果我使用的是Python3,则在我使用list(_bigrams)或其他将执行相同操作的函数实现该生成器之前,不会对其进行评估.

And if I'm using Python3, the generator won't be evaluated until i materialize it with list(_bigrams) or some other functions that will do the same.

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True: 
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(_bigrams).most_common(100)
    # Do some manipulation to text and repeat the counting.
    text = manipulate(text, top100)      

但是,每次迭代大约需要1+秒,而100,000次迭代将太长.

But that takes around 1+ seconds per iteration, and 100,000 iterations would be too long.

我也尝试过sklearn CountVectorizer,但是提取,计数和获得前100个双字母组的时间与本地python相当.

I've also tried sklearn CountVectorizer but the time to extract, count and get the top100 bigrams are comparable to the native python.

然后,我对multiprocessing进行了实验,对> Python多处理进行了一些修改和一个共享计数器 http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing :

Then I've experimented with some multiprocessing, using slight modification from Python multiprocessing and a shared counter and http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:

from multiprocessing import Process, Manager, Lock

import time

class MultiProcCounter(object):
    def __init__(self):
        self.dictionary = Manager().dict()
        self.lock = Lock()

    def increment(self, item):
        with self.lock:
            self.dictionary[item] = self.dictionary.get(item, 0) + 1

def func(counter, item):
    counter.increment(item)

def multiproc_count(inputs):
    counter = MultiProcCounter()
    procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
    for p in procs: p.start()
    for p in procs: p.join()
    return counter.dictionary

inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]

print (multiproc_count(inputs))

但是在二元组中使用MultiProcCounter时,每次迭代花费的时间甚至超过1+秒.我不知道为什么会这样,使用int示例的虚拟列表,multiproc_count会完美地工作.

But using the MultiProcCounter in the bigram counting takes even longer than 1+ seconds per iteration. I've no idea why is that the case, using the dummy list of int example, the multiproc_count works perfectly.

我尝试过:

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True:
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(multiproc_count(_bigrams)).most_common(100)

有什么方法可以在Python中真正快速地计算出二元数吗?

推荐答案

import os, thread

text = 'I really like cheese' #just load whatever you want here, this is just an example

CORE_NUMBER = os.cpu_count() # may not be available, just replace with how many cores you have if it crashes

ready = []
bigrams = []

def extract_bigrams(cores):
    global ready, bigrams
    bigrams = []
    ready = []
    for a in xrange(cores): #xrange is best for performance
        bigrams.append(0)
        ready.append(0)
    cpnt = 0#current point
    iterator = int(len(text)/cores)
    for a in xrange(cores-1):
        thread.start_new(extract_bigrams2, (cpnt, cpnt+iterator+1, a)) #overlap is intentional
        cpnt += iterator
    thread.start_new(extract_bigrams2, (cpnt, len(text), a+1))
    while 0 in ready:
        pass

def extract_bigrams2(startpoint, endpoint, threadnum):
    global ready, bigrams
    ready[threadnum] = 0
    bigrams[threadnum] = zip(*[text[startpoint+i:endpoint] for i in xrange(2)])
    ready[threadnum] = 1

extract_bigrams(CORE_NUMBER)
thebigrams = []
for a in bigrams:
    thebigrams+=a

print thebigrams

该程序存在一些问题,例如它不能过滤掉空格或标点符号,但是我制作了该程序来说明您应该拍摄的内容.您可以轻松地对其进行编辑以满足您的需求.

There are some issues with this program, like it not filtering out whitespace or punctuation, but I made this program to show what you should be shooting for. You can easily edit it to suit your needs.

该程序自动检测您的计算机有多少个内核,并创建该数量的线程,尝试均匀分布其查找双字母组的区域.我只能在学校拥有的计算机上的在线浏览器中测试此代码,因此无法确定它是否可以完全正常工作.如果有任何问题或疑问,请保留在评论中.

This program auto-detects how many cores your computer has, and creates that number of threads, attempting to evenly distribute the areas where it looks for bigrams. I've only been able to test this code in an online browser on a school owned computer, so I can't be certain this works completely. If there are any problems or questions, please leave them in the comments.

这篇关于真正快速计算二元数(有或没有多处理)-python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆