使用平均方法从word2vec词向量计算句子向量的具体步骤是什么? [英] What are the specifc steps for computing sentence vectors from word2vec word vectors using the averaging method?

查看:46
本文介绍了使用平均方法从word2vec词向量计算句子向量的具体步骤是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

初学者的问题,但我对此有点困惑.希望这个问题的答案也能让其他 NLP 初学者受益.

Beginner question, but I am a bit puzzled by this. Hope the answer to this question can benefit other beginners in NLP as well.

以下是更多详细信息:

我知道你可以从 word2vec 生成的词向量计算句子向量.但是制作这些句子向量所涉及的实际步骤是什么?谁能提供一个直观的例子,然后进行一些计算来解释这个过程?

I know that you can compute sentence vectors from word vectors generated by word2vec. But what are the actual steps involved to make these sentence vectors. Can anyone provide a intuitive example and then some calculations to explain this process?

例如:假设我有一个包含三个词的句子:今天很热.并假设这些词的假设向量值为:(1,2,3)(4,5,6)(7,8,9).我是否通过对这些词向量执行组件平均来获得句子向量?如果向量的长度不同,例如: (1,2)(4,5,6)(7,8,9,23,76) 这些情况下的平均过程是什么样的呢?

eg: Suppose I have a sentence with three words: Today is hot. And suppose these words have hypothetical vector values of: (1,2,3)(4,5,6)(7,8,9). Do I get the sentence vector by performing component-wise averaging of these word vectors? And what if the vectors are of different length eg: (1,2)(4,5,6)(7,8,9,23,76) what does the averaging process look like for these cases?

推荐答案

通过平均词向量来创建文本长度(句子/段落/文档)的向量是一种简单的方法.(它不太擅长捕捉深意,但很容易做到.)

Creating the vector for a length-of-text (sentence/paragraph/document) by averaging the word-vectors is one simple approach. (It's not great at capturing shades-of-meaning, but it's easy to do.)

使用gensim库,可以很简单:

Using the gensim library, it can be as simple as:

import numpy as np
from gensim.models.keyedvectors import KeyedVectors

wv = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
text = "the quick brown fox jumped over the lazy dog"
text_vector = np.mean([wv[word] for word in text.split()], axis=0)

是使用原始词向量,还是使用单位归一化或以其他方式通过某种词重要性度量加权的词向量,都是需要考虑的替代方案.

Whether to use the raw word-vectors, or word-vectors that are either unit-normalized or otherwise weighted by some measure of word significance are alternatives to consider.

相互兼容的词向量将具有相同的维数,因此尝试平均不同大小的向量永远不会出现问题.

Word-vectors that are compatible with each other will have the same number of dimensions, so there's never an issue of trying to average differently-sized vectors.

其他技术,如段落向量"(gensim 中的 Doc2Vec)可能会在某些语料库中为某些目的提供更好的文本向量.

Other techniques like 'Paragraph Vectors' (Doc2Vec in gensim) might give better text-vectors for some purposes, on some corpuses.

其他利用词向量来比较文本相似性的技术,例如Word Mover 的距离"(WMD),可能会比比较单个摘要向量提供更好的成对文本相似度分数.(WMD 不会将文本缩减为单个向量,并且计算起来可能很昂贵.)

Other techniques for comparing the similarity of texts that leverage word-vectors, like "Word Mover's Distance" (WMD), might give better pairwise text-similarity scores than comparing single summary vectors. (WMD doesn't reduce a text to a single vector, and can be expensive to calculate.)

这篇关于使用平均方法从word2vec词向量计算句子向量的具体步骤是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆