预训练 word2vec 模型的进程之间共享内存? [英] Shared memory among processes for pre-trained word2vec model?

查看:35
本文介绍了预训练 word2vec 模型的进程之间共享内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查找对象,特别是来自 gensim.models.keyedvectors.Word2VecKeyedVectors 的预训练 word2vec 模型.我需要做一些数据预处理,我正在使用多处理.有没有办法让我的所有进程都可以使用来自同一内存位置的对象,而不是每个进程都将对象加载到自己的内存中?

I have a look-up object, specifically a pre-trained word2vec model from gensim.models.keyedvectors.Word2VecKeyedVectors. I need to do some data pre-processing and I am using multi-processing for the same. Is there a way in which all of my processes can use the object from the same memory location instead of each process loading the object into its own memory?

推荐答案

是,如果:

  • 文件是使用 Gensim 内部的 .save() 方法保存的,相关的大向量数组显然是单独的 .npy 文件
  • 使用 Gensim 的内部 .load() 方法加载文件,使用 mmap 选项
  • 避免执行任何无意中导致每个进程的对象完全重新分配后备数组的操作(破坏 mmap 共享).
  • the files were saved using Gensim's internal .save() method, and the relevant large-arrays of vectors are clearly separate .npy files
  • the files are loaded using Gensim's internal .load() method, with the mmap option
  • you avoid doing any operations which inadvertently cause each process's object to reallocate the backing array completely (breaking the mmap-sharing).

请参阅这个先前的答案,了解类似需求的步骤/问题的概述.

See this prior answer for an overview of the steps/concerns of a similar need.

(此处列出的问题和额外步骤以避免破坏 mmap 共享 - 通过手动修补 norm 属性 - 在 Gensim 4.0.0 中不再需要,目前仅作为预发布版本提供.)

(The concern & extra steps listed there to avoid breaking the mmap-sharing – by performing manual patch-ups of the norm properties – should no longer be necessary in Gensim 4.0.0, currently available only as a prerelease version.)

这篇关于预训练 word2vec 模型的进程之间共享内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆