如何加快 Gensim Word2vec 模型加载时间? [英] How to speed up Gensim Word2vec model load time?

查看:89
本文介绍了如何加快 Gensim Word2vec 模型加载时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在构建一个聊天机器人,所以我需要使用 Word2Vec 对用户的输入进行矢量化处理.

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.

我使用的是 Google 提供的 300 万字的预训练模型 (GoogleNews-vectors-negative300).

I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).

所以我使用 Gensim 加载模型:

So I load the model using Gensim:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

问题是加载模型大约需要2分钟.我不能让用户等那么久.

The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.

那么我可以做些什么来加快加载时间?

So what can I do to speed up the load time?

我考虑将 300 万个单词中的每一个及其对应的向量放入 MongoDB 数据库中.这肯定会加快速度,但直觉告诉我这不是一个好主意.

I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.

推荐答案

在最近的 gensim 版本中,您可以使用可选的 limit 参数从文件的前面开始加载子集到 load_word2vec_format().(GoogleNews 向量似乎大致按频率从高到低的顺序排列,因此前 N 个通常是您想要的 N 大小的子集.因此,使用 limit=500000 来获得最多 -频繁的 500,000 个词向量——仍然是一个相当大的词汇量——节省了 5/6 的内存/加载时间.)

In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)

所以这可能会有所帮助.但是,如果您为每个 Web 请求重新加载,您仍然会受到加载的 IO 限制速度以及存储每次重新加载的冗余内存开销的伤害.

So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.

您可以结合使用一些技巧来提供帮助.

There are some tricks you can use in combination to help.

请注意,在以原始 word2vec.c 格式加载此类向量后,您可以使用 gensim 的原生 save() 重新保存它们.如果您将它们保存为未压缩,并且支持数组足够大(并且 GoogleNews 集绝对足够大),则支持数组将以原始二进制格式转储到单独的文件中.该文件稍后可以从磁盘进行内存映射,使用 gensim 的原生 [load(filename, mmap='r')][1] 选项.

Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.

最初,这将使加载看起来很迅速——操作系统不会从磁盘读取所有数组,而是将虚拟地址区域映射到磁盘数据,以便一段时间后,当代码访问这些内存位置时,需要的范围将从磁盘读取.到目前为止一切顺利!

Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!

但是,如果您正在执行诸如 most_similar() 之类的典型操作,您仍然会遇到很大的滞后,只是稍晚一些.这是因为此操作需要对所有向量进行初始扫描和计算(在第一次调用时,为每个单词创建单位长度归一化向量),然后对所有归一向量进行另一次扫描和计算(在每次调用,以找到 N 个最相似的向量).这些全扫描访问会将整个阵列分页到 RAM - 再次花费几分钟的磁盘 IO.

However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.

您想要的是避免重复进行单元标准化,并且只需支付一次 IO 成本.这需要将向量保存在内存中,以供所有后续 Web 请求(甚至多个并行 Web 请求)重复使用.幸运的是,内存映射在这里也有帮助,尽管需要一些额外的准备步骤.

What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.

首先,使用 load_word2vec_format() 加载 word2vec.c 格式的向量.然后,使用 model.init_sims(replace=True) 强制单位标准化,破坏性地就地(破坏非标准化向量).

First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).

然后,将模型保存到一个新的文件名前缀:model.save('GoogleNews-vectors-gensim-normed.bin'`.(请注意,这实际上在磁盘上创建了多个文件,这些文件需要保存在一起以供模型要重新加载.)

Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)

现在,我们将编写一个简短的 Python 程序,用于加载向量的内存映射,将完整数组强制加载到内存中.我们还希望这个程序挂起直到外部终止(保持映射活着),并且注意不要重新计算已经规范的向量.这需要另一个技巧,因为加载的 KeyedVectors 实际上不知道向量是规范的.(通常只保存原始向量,并在需要时重新计算规范版本.)

Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)

大致如下:

from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
model.most_similar('stuff')  # any word will do: just to page all in
Semaphore(0).acquire()  # just hang until process killed

这仍然需要一段时间,但只需要在任何 Web 请求之前/之外执行一次.当进程处于活动状态时,向量保持映射到内存中.此外,除非/直到存在其他虚拟内存压力,否则向量应保持加载在内存中.这对接下来的事情很重要.

This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.

最后,在您的网络请求处理代码中,您现在可以执行以下操作:

Finally, in your web request-handling code, you can now just do the following:

model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model

多个进程可以共享只读内存映射文件.(也就是说,一旦操作系统知道文件 X 在 RAM 中的某个位置,其他所有需要只读映射版本的 X 的进程都将被引导在该位置重新使用该数据.).

Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).

所以这个 web-reqeust load()以及任何后续访问,都可以重新使用先前进程已经带入地址空间并激活的数据-记忆.需要对每个向量进行相似性计算的操作仍然需要时间访问多个 GB 的 RAM,并进行计算/排序,但不再需要额外的磁盘 IO 和冗余重新规范化.

So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.

如果系统面临其他内存压力,数组的范围可能会耗尽内存,直到下一次读取页面它们重新进入.如果机器缺乏 RAM 来完全加载向量,那么每次扫描都需要一个不管怎样,混合分页进出和性能都会令人沮丧地糟糕.(在这种情况下:获得更多 RAM 或使用更小的向量集.)

If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)

但是如果您确实有足够的 RAM,这最终会使原始/自然直接加载和使用的代码以相当快的方式正常工作",无需额外的 Web 服务接口,因为机器的共享文件-映射内存用作服务接口.

But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.

这篇关于如何加快 Gensim Word2vec 模型加载时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆