如何从gensim中的Word2Vec模型中完全删除单词? [英] How to remove a word completely from a Word2Vec model in gensim?

查看:85
本文介绍了如何从gensim中的Word2Vec模型中完全删除单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个模型,例如

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

可以从w2v词汇表中删除该单词,例如

It's possible to remove the word from the w2v vocabulary, e.g.

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

但是,当我们在删除graph之后对其他单词进行相似性处理时,我们会看到graph这个单词突然出现,例如

But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

如何在gensim中从Word2Vec模型中完全删除单词?

回答@vumaasha的评论:

To answer @vumaasha's comment:

您能否提供一些有关为什么要删除单词的详细信息

could you give some details as to why you want to delete a word

  • 让我们在语料库中的所有单词中说出我的单词宇宙,以学习所有单词之间的紧密关系.

    • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.

      但是当我想生成相似的单词时,它应该仅来自领域特定单词的子集.

      But when I want to generate the similar words, it should only come from a subset of domain specific word.

      可以从.most_similar()生成足够多的内容,然后对单词进行过滤,但可以说特定领域的空间很小,我可能正在寻找排名最接近第1000个相似单词的单词,这种单词效率不高.

      It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.

      最好将单词从单词向量中完全删除,然后.most_similar()单词不会返回超出特定域的单词.

      It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.

      推荐答案

      我编写了一个函数,该函数从KeyedVectors中删除不在预定义单词列表中的单词.

      I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.

      def restrict_w2v(w2v, restricted_word_set):
          new_vectors = []
          new_vocab = {}
          new_index2entity = []
          new_vectors_norm = []
      
          for i in range(len(w2v.vocab)):
              word = w2v.index2entity[i]
              vec = w2v.vectors[i]
              vocab = w2v.vocab[word]
              vec_norm = w2v.vectors_norm[i]
              if word in restricted_word_set:
                  vocab.index = len(new_index2entity)
                  new_index2entity.append(word)
                  new_vocab[word] = vocab
                  new_vectors.append(vec)
                  new_vectors_norm.append(vec_norm)
      
          w2v.vocab = new_vocab
          w2v.vectors = new_vectors
          w2v.index2entity = new_index2entity
          w2v.index2word = new_index2entity
          w2v.vectors_norm = new_vectors_norm
      

      它根据 Word2VecKeyedVectors .

      用法:

      w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
      w2v.most_similar("beer")
      

      [('啤酒',0.8409687876701355),
      ("lager",0.7733745574951172),
      (啤酒",0.71753990650177),
      (饮料",0.668931245803833),
      ("lagers",0.6570086479187012),
      ('Yuengling_Lager',0.655455470085144),
      ('microbrew',0.6534324884414673),
      ('Brooklyn_Lager',0.6501551866531372),
      ('suds',0.6497018337249756),
      ("brewed_beer",0.6490240097045898)

      [('beers', 0.8409687876701355),
      ('lager', 0.7733745574951172),
      ('Beer', 0.71753990650177),
      ('drinks', 0.668931245803833),
      ('lagers', 0.6570086479187012),
      ('Yuengling_Lager', 0.655455470085144),
      ('microbrew', 0.6534324884414673),
      ('Brooklyn_Lager', 0.6501551866531372),
      ('suds', 0.6497018337249756),
      ('brewed_beer', 0.6490240097045898)]

      restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
      restrict_w2v(w2v, restricted_word_set)
      w2v.most_similar("beer")
      

      [('lagers',0.6570085287094116),
      (葡萄酒",0.6217695474624634),
      ("bash",0.20583480596542358),
      (计算机",0.06677375733852386),
      ("python",0.005948573350906372)]

      [('lagers', 0.6570085287094116),
      ('wine', 0.6217695474624634),
      ('bash', 0.20583480596542358),
      ('computer', 0.06677375733852386),
      ('python', 0.005948573350906372)]

      这篇关于如何从gensim中的Word2Vec模型中完全删除单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆