如何让cython和gensim与pyspark一起使用 [英] How to get cython and gensim to work with pyspark
问题描述
我正在运行安装了 gcc
的Lubuntu 16.04计算机。我没有得到 gensim
与 cython
一起工作,因为当我训练 doc2vec模型
,它只能由一位缓慢的工人进行培训。
I'm running a Lubuntu 16.04 Machine with gcc
installed. I'm not getting gensim
to work with cython
because when I train a doc2vec model
, it is only ever trained with one worker which is dreadfully slow.
我说过 gcc
是从一开始就安装的。然后,我可能会犯错,并在 cython
之前安装了 gensim
。我通过强制通过 pip
重新安装 gensim
来纠正此问题。
As I said gcc
was installed from the start. I then maybe made the mistake and installed gensim
before cython
. I corrected that by forcing a reinstall of gensim
via pip
. With no effect still just one worker.
机器设置为 spark
主计算机,我与<$通过 pyspark
c $ c> spark 。它的工作原理如下: pyspark
使用 jupyter
和 jupyter
使用python 3.5。这样,我就可以在集群中使用 jupyter
接口。现在我不知道这是否就是为什么我不能让 gensim
与 cython
一起工作的原因。我不在集群上执行任何gensim代码,启动 jupyter
来执行 gensim
。
The machine is setup as a spark
master and I interface with spark
via pyspark
. It works something like this, pyspark
uses jupyter
and jupyter
uses python 3.5. This way I get a jupyter
interface to my cluster. Now I have no idea if this is the reason why i cant get gensim
to work with cython
. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter
to also do gensim
.
推荐答案
在深入研究并尝试将整个语料库加载到内存中(在不同环境中执行gensim等)之后,所有操作均无效。 gensim似乎仅将代码部分并行化是一个问题。这导致工作人员无法充分利用CPU。请参见github 链接上的问题。
After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.
这篇关于如何让cython和gensim与pyspark一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!