用gensim的fasttext的包装器训练单词嵌入后,如何嵌入新句子? [英] After training word embedding with gensim's fasttext's wrapper, how to embed new sentences?

查看:78
本文介绍了用gensim的fasttext的包装器训练单词嵌入后,如何嵌入新句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在gensim的 docs 中阅读了该教程,我不明白从经过训练的模型生成新嵌入的正确方法是什么.到目前为止,我已经训练了gensim的快速文本嵌入,如下所示:

After reading the tutorial at gensim's docs, I do not understand what is the correct way of generating new embeddings from a trained model. So far I have trained gensim's fast text embeddings like this:

from gensim.models.fasttext import FastText as FT_gensim

model_gensim = FT_gensim(size=100)

# build the vocabulary
model_gensim.build_vocab(corpus_file=corpus_file)

# train the model
model_gensim.train(
    corpus_file=corpus_file, epochs=model_gensim.epochs,
    total_examples=model_gensim.corpus_count, total_words=model_gensim.corpus_total_words
)

然后,假设我要获取与此句子相关的嵌入向量:

Then, let's say I want to get the embeddings vectors associated with this sentences:

sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

如何使用我先前训练的 model_gensim 来获得它们?

How can I get them with model_gensim that I trained previously?

推荐答案

您可以依次查找每个单词的向量:

You can look up each word's vector in turn:

wordvecs_obama = [model_gensim[word] for word in sentence_obama]

对于7个单词的输入句子,您将在 wordvecs_obama 中找到7个单词向量的列表.

For your 7-word input sentence, you'll then have a list of 7 word-vectors in wordvecs_obama.

就其固有功能而言,所有FastText模型都不会将较长的文本转换为单个向量.(特别是,您训练的模型没有默认的操作方式.)

All FastText models do not, as a matter of their inherent functionality, convert longer texts into single vectors. (And specifically, the model you've trained doesn't have a default way of doing that.)

原始的Facebook FastText代码中存在一种分类模式",涉及一种不同的训练方式,其中,在训练时将文本与已知标签相关联,并且在训练期间将句子的所有单词向量组合在一起,并且稍后要求模型对新文本进行分类时.但是,FastText的 gensim 实现目前不支持此模式,因为 gensim 的目标是提供无监督而不是受监督的算法.

There is a "classification mode" in the original Facebook FastText code that involves a different style of training, where texts are associated with known labels at training time, and all the word-vectors of the sentence are combined during training, and when the model is later asked to classify new texts. But, the gensim implementation of FastText does not currently support this mode, as gensim's goal has been to supply unsupervised rather than supervised algorithms.

您可以通过对这些单词向量求平均来近似FastText模式的作用:

You could approximate what that FastText mode does by averaging together those word-vectors:

import numpy as np
meanvec_obama = np.array(wordvecs_obama).mean(axis=0)

根据您的最终目的,类似的操作可能仍然有用.(但是,该平均值对分类没有太大的帮助,就好像单词向量最初是在该FastText模式下使用已知标签针对该目标进行训练的.)

Depending on your ultimate purposes, something like that might still be useful. (But, that average wouldn't be as useful for classification as if the word-vectors had originally ben trained for that goal, with known labels, in that FastText mode.)

这篇关于用gensim的fasttext的包装器训练单词嵌入后,如何嵌入新句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆