为什么词嵌入实际上是向量? [英] Why are word embedding actually vectors?

查看:187
本文介绍了为什么词嵌入实际上是向量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对我的幼稚感到抱歉,但是我不明白为什么神经网络训练过程(word2vec)产生的词嵌入实际上是向量.

I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.

嵌入是降维的过程,在训练过程中,NN将词的1/0数组缩小为较小的数组,该过程不执行任何应用矢量算法的事情.

Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.

因此,结果是我们只有数组而不是向量.为什么要将这些数组视为向量?

So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?

尽管我们有矢量,为什么每个人都将它们描述为来自原点(0,0)的矢量?

Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?

再次,很抱歉,我的问题看起来很愚蠢.

Again, I am sorry if my question looks stupid.

推荐答案

什么是嵌入?

词嵌入是自然语言处理(NLP)中一组语言建模和特征学习技术的统称,其中词汇中的词或短语被映射到实数向量.

从概念上讲,它涉及从每个单词一维的空间到维数较低的连续向量空间的数学嵌入.

Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

(来源: https://en.wikipedia.org/wiki/Word_embedding )

什么是Word2Vec?

Word2vec是一组用于生成单词嵌入的相关模型.这些模型是浅层的两层神经网络,经过训练可以重建单词的语言环境.

Word2vec将一个大型文本语料库作为输入,并产生一个通常具有几百个维度的向量空间,该语料库中的每个唯一单词都在该空间中分配了一个相应的向量.

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

单词向量位于向量空间中,以便使在语料库中共享公共上下文的单词在空间中彼此紧邻.

Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

(来源: https://en.wikipedia.org/wiki/Word2vec )

什么是数组?

在计算机科学中,数组数据结构(或简称为数组)是由元素(值或变量)的集合组成的数据结构,每个元素均由至少一个数组索引或键标识.

In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.

存储数组,以便可以通过数学公式从其索引元组计算每个元素的位置.

An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.

最简单的数据结构类型是线性数组,也称为一维数组.

什么是向量/向量空间?

向量空间(也称为线性空间)是称为向量的对象的集合,可以将这些对象加在一起并乘以(称为标量")数字.

A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.

标量通常被认为是实数,但是也有矢量空间,其标量乘以复数,有理数或通常任何字段.

Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.

向量加法和标量乘法的运算必须满足某些要求,称为下面列出的公理.

The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.

(来源: https://en.wikipedia.org/wiki/Vector_space )

向量和数组有什么区别?

首先,单词嵌入中的向量并不完全是编程语言数据结构(因此它不是

Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).

通过编程,单词嵌入向量 IS 某种实数(即标量)数组(数据结构)

Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)

从数学上讲,任何具有一个或多个以实数填充的维的元素都是张量.向量是标量的一维.

Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.

要回答OP问题:

为什么单词嵌入实际上是矢量?

根据定义,词嵌入是矢量(见上文)

By definition, word embeddings are vectors (see above)

为什么我们将单词表示为实数向量?

要了解单词之间的差异,我们必须以某种方式量化差异.

To learn the differences between words, we have to quantify the difference in some manner.

想象一下,如果我们将这些智能"数字分配给以下单词:

Imagine, if we assign theses "smart" numbers to the words:

>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848

我们看到fruitapple之间的距离很近,但samsungapple却不是.在这种情况下,单词的单个数字特征"能够捕获有关单词含义的一些信息,但不能完全捕获.

We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.

想象一下,每个单词(即向量)有两个实数值:

Imagine the we have two real number values for each word (i.e. vector):

>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}

要计算差异,我们可以这样做:

To compute the difference, we could have done:

>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])

>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848,    8])

这不是非常有用的信息,它返回一个向量,并且我们无法确切地确定单词之间的距离,因此我们可以尝试一些矢量技巧并计算向量之间的距离,例如欧几里德距离:

That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:

>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])

>>> np.linalg.norm(apple-orange)
763.03604108849277

>>> np.linalg.norm(apple-samsung)
848.03773500947466

>>> np.linalg.norm(orange-samsung)
1083.4685043876448

现在,我们可以看到比"orange"到"samsung"更接近"samsung"的信息".可能是因为appleorange相比,与samsung共同出现在语料库中的频率更高.

Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.

一个大问题来了,我们如何获得这些实数来表示单词的向量?" .这就是Word2Vec/嵌入训练算法( Bengio 2003最初设想的地方)进来.

The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.

由于向表示单词的向量中添加更多实数更有意义,那么为什么我们不添加更多维度(即,每个单词向量中的列数)呢?

Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?

传统上,我们通过计算分布语义/分布式词法语义,但是如果单词不与另一个单词同时出现,则矩阵实际上变得稀疏,其中包含许多零值.

Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.

因此,在计算了降维付出了很多努力. ="https://stackoverflow.com/questions/24073030/what-are-co-occurance-matrixes-and-how-are-they-used-in-nlp">单词共现矩阵.恕我直言,这就像是单词之间的全局关系的自顶向下视图,然后压缩矩阵以得到一个较小的向量来表示每个单词.

Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.

因此,深度学习"词嵌入的创建来自另一种流派,并从随机(有时不是那么随机)开始为每个词初始化向量层,并学习这些向量的参数/权重并进行优化这些参数/权重通过基于某些定义的属性最小化某些损失函数来实现.

So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.

听起来有些模糊,但具体来说,如果我们看看Word2Vec学习技术,它将更加清晰,请参阅

It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see

  • https://rare-technologies.com/making-sense-of-word2vec/
  • http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
  • https://arxiv.org/pdf/1402.3722.pdf (more mathematical)

这里有更多资源可供阅读,以了解单词嵌入: https://github.com. com/keon/awesome-nlp#word-vectors

Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors

这篇关于为什么词嵌入实际上是向量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆