如何培养Word2vec在非常大的数据集? [英] How to train Word2vec on very large datasets?

查看:1431
本文介绍了如何培养Word2vec在非常大的数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在超过10 TB的庞大的大规模数据+在网络爬虫转储大小思维训练word2vec的。

I am thinking of training word2vec on huge large scale data of more than 10 TB+ in size on web crawl dump.

我亲自训练的C语言实现了Googlenews-2012转储(1.5GB)在我的iMac电脑花了大约3个小时训练,并产生向量(有速度pssed IM $ P $)。我没有尝试Python实现,虽然:(我读的地方,在300向量长​​度的维基转储(11GB)产生向量大约需要9天时间产生。

I personally trained c implementation GoogleNews-2012 dump (1.5gb) on my iMac took about 3 hours to train and generate vectors (impressed with speed). I did not try python implementation though :( I read somewhere that generating vectors on wiki dump (11gb) of 300 vector length takes about 9 days to generate.


  1. 如何加快word2vec?我是否需要使用分布式模型或者我需要在2-3天内做什么类型的硬件?我有跟的iMac 8GB的内存。

  1. How to speed up word2vec? Do i need to use distributed models or what type of hardware i need to do it within 2-3 days? i have iMac with 8gb ram.

哪一个是更快? Gensim蟒蛇或C FPGA实现?

Which one is faster? Gensim python or C implemention?

我看到word2vec实现不支持GPU训练。

I see that word2vec implementation does not support GPU training.

推荐答案

有一些机会,大规模创造Word2Vec车型。如你所指出的,候选解决方案被分配(和/或多线程)或GPU。这不是一个详尽的清单,但希望你得到一些想法,如何进行。

There are a number of opportunities to create Word2Vec models at scale. As you pointed out, candidate solutions are distributed (and/or multi-threaded) or GPU. This is not an exhaustive list but hopefully you get some ideas as to how to proceed.

分布式/多线程选项:


  • Gensim 使用用Cython地方事务,并等于或不
    比C实现慢得多。 Gensim的多线程工作
    好,用机有充足的内存和大量的
    芯显著减小矢量生成时间。您可能希望
    调查使用Amazon EC2的16或32核的实例。

  • Deepdist 可以利用gensim和Spark在集群中分发gensim工作量。 Deepdist也有一些聪明的SGD
    其中优化跨节点同步梯度。如果你使用
    多核机器为节点,您可以利用两者的
    集群和多线程技术。

  • Gensim uses Cython where it matters, and is equal to, or not much slower than C implementations. Gensim's multi-threading works well, and using a machine with ample memory and a large number of cores significantly decreases vector generation time. You may want to investigate using Amazon EC2 16 or 32-core instances.
  • Deepdist can utilize gensim and Spark to distribute gensim workloads across a cluster. Deepdist also has some clever SGD optimizations which synchronize gradient across nodes. If you use multi-core machines as nodes, you can take advantage of both clustering and multi-threading.

存在许多Word2Vec GPU实现。面对庞大的数据集大小,并且限制GPU内存,你可能要考虑一个集群的战略。

A number of Word2Vec GPU implementations exist. Given the large dataset size, and limited GPU memory you may have to consider a clustering strategy.


  • Bidmach 显然是非常快的(但文档缺乏,诚然我已经努力得到它的工作)。

  • DL4J 有Word2Vec实现,但球队仍未实施CUBLAS GEMM,它是比较慢的CPU VS

  • Keras 是一个Python深刻的学习框架,利用Theano。虽然它不实现word2vec本身,但它确实实现一个埋入层,并且可以被用于创建和查询字向量。

  • Bidmach is apparently very fast (documentation is however lacking, and admittedly I've struggled to get it working).
  • DL4J has a Word2Vec implementation but the team has yet to implement cuBLAS gemm and it's relatively slow vs CPUs.
  • Keras is a Python deep learning framework that utilizes Theano. While it does not implement word2vec per se, it does implement an embedding layer and can be used to create and query word vectors.

有许多Word2Vec的其他的CUDA实现的,在不同程度的成熟度和支持的

There are a number of other CUDA implementations of Word2Vec, at varying degrees of maturity and support:

  • https://github.com/whatupbiatch/cuda-word2vec [memory mgmt looks great, though non-existant documentation on how to create datasets]
  • https://github.com/fengChenHPC/word2vec_cbow [super-fast, but GPU memory issues on large datasets]

我相信SparkML小组最近竟​​能基于CUBLAS原型Word2Vec实施。您可能需要进行调查。

I believe the SparkML team has recently got going a prototype cuBLAS-based Word2Vec implementation. You may want to investigate this.

这篇关于如何培养Word2vec在非常大的数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆