NLTK word_tokenizer的Python多处理-函数永远不会完成 [英] Python Multiprocessing of NLTK word_tokenizer - function never completes

查看:124
本文介绍了NLTK word_tokenizer的Python多处理-函数永远不会完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用NLTK在一些相当大的数据集上执行自然语言处理,并希望利用我所有的处理器内核.似乎我正在使用多处理模块,当我运行以下测试代码时,我看到所有内核都在使用,但是代码从未完成.

I'm performing natural language processing using NLTK on some fairly large datasets and would like to take advantage of all my processor cores. Seems the multiprocessing module is what I'm after, and when I run the following test code I see all cores are being utilized, but the code never completes.

在不进行多处理的情况下执行同一任务大约需要一分钟.

Executing the same task, without multiprocessing, finishes in approximately one minute.

debian上的Python 2.7.11.

Python 2.7.11 on debian.

from nltk.tokenize import word_tokenize
import io
import time
import multiprocessing as mp

def open_file(filepath):
    #open and parse file
    file = io.open(filepath, 'rU', encoding='utf-8')
    text = file.read()
    return text

def mp_word_tokenize(text_to_process):
    #word tokenize
    start_time = time.clock()
    pool = mp.Pool(processes=8)
    word_tokens = pool.map(word_tokenize, text_to_process)
    finish_time = time.clock() - start_time
    print 'Finished word_tokenize in [' + str(finish_time) + '] seconds. Generated [' + str(len(word_tokens)) + '] tokens'
    return word_tokens

filepath = "./p40_compiled.txt"
text = open_file(filepath)
tokenized_text = mp_word_tokenize(text)

推荐答案

已弃用

此答案已过时. 请改为查看 https://stackoverflow.com/a/54032108/610569

DEPRECATED

This answer is outdated. Please see https://stackoverflow.com/a/54032108/610569 instead

这是使用 sframe 进行欺骗的骗子方法:

Here's a cheater's way to do multi-threading using sframe:

>>> import sframe
>>> import time
>>> from nltk import word_tokenize
>>> 
>>> import urllib.request
>>> url = 'https://raw.githubusercontent.com/Simdiva/DSL-Task/master/data/DSLCC-v2.0/test/test.txt'
>>> response = urllib.request.urlopen(url)
>>> data = response.read().decode('utf8')
>>> 
>>> for _ in range(10):
...     start = time.time()
...     for line in data.split('\n'):
...         x = word_tokenize(line)
...     print ('word_tokenize():\t', time.time() - start)
... 
word_tokenize():     4.058445692062378
word_tokenize():     4.05820369720459
word_tokenize():     4.090051174163818
word_tokenize():     4.210559129714966
word_tokenize():     4.17473030090332
word_tokenize():     4.105806589126587
word_tokenize():     4.082665681838989
word_tokenize():     4.13646936416626
word_tokenize():     4.185062408447266
word_tokenize():     4.085020065307617

>>> sf = sframe.SFrame(data.split('\n'))
>>> for _ in range(10):
...     start = time.time()
...     x = list(sf.apply(lambda x: word_tokenize(x['X1'])))
...     print ('word_tokenize() with sframe:\t', time.time() - start)
... 
word_tokenize() with sframe:     7.174573659896851
word_tokenize() with sframe:     5.072867393493652
word_tokenize() with sframe:     5.129574775695801
word_tokenize() with sframe:     5.10952091217041
word_tokenize() with sframe:     5.015898942947388
word_tokenize() with sframe:     5.037845611572266
word_tokenize() with sframe:     5.015375852584839
word_tokenize() with sframe:     5.016635894775391
word_tokenize() with sframe:     5.155989170074463
word_tokenize() with sframe:     5.132697105407715

>>> for _ in range(10):
...     start = time.time()
...     x = [word_tokenize(line) for line in data.split('\n')]
...     print ('str.split():\t', time.time() - start)
... 
str.split():     4.176181793212891
str.split():     4.116339921951294
str.split():     4.1104896068573
str.split():     4.140819549560547
str.split():     4.103625774383545
str.split():     4.125757694244385
str.split():     4.10755729675293
str.split():     4.177418947219849
str.split():     4.11145281791687
str.split():     4.140623092651367

请注意,速度差异可能是因为我在其他内核上还有其他运行.但是,有了更大的数据集和专用内核,您真的可以看到这种规模.

Note that the speed difference might be because I have something else running on the other cores. But given a much larger dataset and dedicated cores, you can really see this scale.

这篇关于NLTK word_tokenizer的Python多处理-函数永远不会完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆