如何让cython和gensim与pyspark一起使用 [英] How to get cython and gensim to work with pyspark

查看:501
本文介绍了如何让cython和gensim与pyspark一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行安装了 gcc 的Lubuntu 16.04计算机。我没有得到 gensim cython 一起工作,因为当我训练 doc2vec模型,它只能由一位缓慢的工人进行培训。

I'm running a Lubuntu 16.04 Machine with gcc installed. I'm not getting gensim to work with cython because when I train a doc2vec model, it is only ever trained with one worker which is dreadfully slow.

我说过 gcc 是从一开始就安装的。然后,我可能会犯错,并在 cython 之前安装了 gensim 。我通过强制通过 pip 重新安装 gensim 来纠正此问题。

As I said gcc was installed from the start. I then maybe made the mistake and installed gensim before cython. I corrected that by forcing a reinstall of gensim via pip. With no effect still just one worker.

机器设置为 spark 主计算机,我与<$通过 pyspark c $ c> spark 。它的工作原理如下: pyspark 使用 jupyter jupyter 使用python 3.5。这样,我就可以在集群中使用 jupyter 接口。现在我不知道这是否就是为什么我不能让 gensim cython 一起工作的原因。我不在集群上执行任何gensim代码,启动 jupyter 来执行 gensim 。

The machine is setup as a spark master and I interface with spark via pyspark. It works something like this, pyspark uses jupyter and jupyter uses python 3.5. This way I get a jupyter interface to my cluster. Now I have no idea if this is the reason why i cant get gensim to work with cython. I don't execute any gensim code on the cluster, it is just more convenient to fire up jupyter to also do gensim.

推荐答案

在深入研究并尝试将整个语料库加载到内存中(在不同环境中执行gensim等)之后,所有操作均无效。 gensim似乎仅将代码部分并行化是一个问题。这导致工作人员无法充分利用CPU。请参见github 链接上的问题。

After digging deeper and trying things like loading the whole corpus into memory executing gensim in a different environment etc. all with no effect. It seems it is a problem with gensim that the code is only partial parallelized. This results in the workers not being able to fully utilize the CPU. See the issues on github link.

这篇关于如何让cython和gensim与pyspark一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆