带Spacy管的多线程NLP [英] Multi-Threaded NLP with Spacy pipe
问题描述
我正在尝试将Spacy NLP(自然语言处理)规则应用到像Wikipedia Dump这样的大文本文件中.这是我基于Spacy的文档示例的代码:
I'm trying to apply Spacy NLP (Natural Language Processing) pipline to a big text file like Wikipedia Dump. Here is my code based on Spacy's documentation example:
from spacy.en import English
input = open("big_file.txt")
big_text= input.read()
input.close()
nlp= English()
out = nlp.pipe([unicode(big_text, errors='ignore')], n_threads=-1)
doc = out.next()
Spacy一次应用所有nlp操作,例如POS标记,去胶化等.就像NLP的管道一样,它一步一步地满足了您的所有需求.应用管道方法应该通过对管道的昂贵部分进行多线程处理来使处理过程更快.但是我看不出速度有什么大的提高,我的CPU使用率大约是25%(只有4个内核之一在工作).我还尝试以多个块的形式读取文件,并增加了输入文本的数量:
Spacy applies all nlp operations like POS tagging, Lemmatizing and etc all at once. It is like a pipeline for NLP that takes care of everything you need in one step. Applying pipe method tho is supposed to make the process a lot faster by multithreading the expensive parts of the pipeline. But I don't see big improvement in speed and my CPU usage is around 25% (only one of 4 cores working). I also tried to read the file in multiple chuncks and increase the batch of input texts:
out = nlp.pipe([part1, part2, ..., part4], n_threads=-1)
,但性能仍然相同.无论如何,有什么可以加快这一进程的吗?我怀疑应该启用OpenMP功能来编译Spacy以利用多线程功能.但是没有有关如何在Windows上执行操作的说明.
but still the same performance. Is there anyway to speed up the process? I suspect that OpenMP feature should be enabled compiling Spacy to utilize multi-threading feature. But there is no instructions on how to do it on Windows.
推荐答案
我想出了问题所在. OpenMP是用于为spacy pipe()方法实现多线程的软件包.默认情况下,此选项对于MSVC编译器是禁用的.在使用openmp支持编译源代码之后,它可以很好地工作.我还提出了拉请求,以在以后的版本中启用此功能.因此,对于0.100.7之后的版本(最新版本),带有pipe()的多线程应该可以在Windows上正常工作.
I figured what the problem was. OpenMP is the package used in implementing multithreading for spacy pipe() method. This option is disabled for MSVC compiler by default. After I compiled the source code with openmp support it works great. I also made a pull request to enable this on the next releases. So for releases after 0.100.7 (which is the latest version) multithreading with pipe() should work on Windows with no issue.
这篇关于带Spacy管的多线程NLP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!