了解word2vec(TensorFlow)中的输入和标签 [英] Understanding input and labels in word2vec (TensorFlow)

查看:351
本文介绍了了解word2vec(TensorFlow)中的输入和标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从张量流的向量表示形式中正确理解batch_inputbatch_labels单词" 教程.

I am trying to properly understand the batch_input and batch_labels from the tensorflow "Vector Representations of Words" tutorial.

例如,我的数据

 1 1 1 1 1 1 1 1 5 251 371 371 1685 ...

...开始于

skip_window = 2 # How many words to consider left and right.
num_skips = 1 # How many times to reuse an input to generate a label.

然后生成的输入数组为:

Then the generated input array is:

bach_input = 1 1 1 1 1 1 5 251 371 ....  

这很有意义,从2(=窗口大小)开始,然后连续.标签:

This makes sense, starts from after 2 (= window size) and then continuous. The labels:

batch_labels = 1 1 1 1 1 1 251 1 1685 371 589 ...

我对这些标签不太了解.每个输入右边应该有4个标签(窗口大小为2,每侧).但是batch_label变量的长度相同.

I don't understand these labels very well. There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.

从tensorflow教程中:

From the tensorflow tutorial:

跳过语法模型需要两个输入.一个是一批整数 代表源上下文词,另一个代表目标 话.

The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words.

根据教程,我将两个变量声明为:

As per the tutorial, I have declared the two variables as:

  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)

我应该如何解释batch_labels?

推荐答案

每个输入右边应该有4个标签(每侧窗口大小2).但是batch_label变量的长度是相同的.

There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.

键设置为 num_skips = 1 .此值定义每个单词生成的(input, label)元组的数量.请参见下面具有不同num_skips的示例(我的data顺序似乎与您的顺序不同,对此感到抱歉).

The key setting is num_skips = 1. This value defines the number of (input, label) tuples each word generates. See the examples with different num_skips below (my data sequence seems to be different from yours, sorry about that).

batch, labels = generate_batch(batch_size=8, num_skips=4, skip_window=2)

它为每个单词生成4个标签,即使用整个上下文;由于该批次中仅处理batch_size=8个单词( 12 6 ),其余的将进入下一个批次:

It generates 4 labels for each word, i.e. uses the whole context; since batch_size=8 only 2 words are processed in this batch (12 and 6), the rest will go into the next batch:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [12 12 12 12  6  6  6  6]
labels = [[6 3084 5239 195 195 3084 12 2]]

示例2-num_skips=2

batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)

在这里您希望每个单词在batch序列中出现两次;从4个可能的单词中随机抽取2个标签:

Here you would expect each word appear twice in the batch sequence; the 2 labels are randomly sampled from 4 possible words:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12  12   6   6 195 195   2   2]
labels = [[ 195 3084   12  195 3137   12   46  195]]

示例3-num_skips=1

batch, labels = generate_batch(batch_size=8, num_skips=1, skip_window=2)

最后,此设置与您的设置相同,每个字词仅产生一个标签;每个标签都是从4个字词的上下文中随机抽取的:

Finally, this setting, same as yours, produces exactly one label per each word; each label is drawn randomly from the 4-word context:

data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [  12    6  195    2 3137   46   59  156]
labels = [[  6  12  12 195  59 156  46  46]]

我应该如何解释batch_labels?

How should I interpret the batch_labels?

每个标签都是要从上下文中预测的中心词.但是生成的数据可能不全部包含 (context, center)个元组,具体取决于生成器的设置.

Each label is the center word to be predicted from the context. But the generated data may take not all (context, center) tuples, depending on the settings of the generator.

还要注意,train_labels张量是一维的. Skip-Gram训练模型以从给定的中心词预测任何上下文词,而不是一次预测所有4个上下文词 .这解释了为什么所有训练对(12, 6)(12, 3084)(12, 5239)(12, 195)都是有效的.

Also note that the train_labels tensor is 1-dimensional. Skip-Gram trains the model to predict any context word from the given center word, not all 4 context words at once. This explains why all training pairs (12, 6), (12, 3084), (12, 5239) and (12, 195) are valid.

这篇关于了解word2vec(TensorFlow)中的输入和标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆