了解word2vec(TensorFlow)中的输入和标签 [英] Understanding input and labels in word2vec (TensorFlow)
问题描述
我正在尝试从张量流的向量表示形式中正确理解batch_input
和batch_labels
单词" 教程.
I am trying to properly understand the batch_input
and batch_labels
from the tensorflow "Vector Representations of Words" tutorial.
例如,我的数据
1 1 1 1 1 1 1 1 5 251 371 371 1685 ...
...开始于
skip_window = 2 # How many words to consider left and right.
num_skips = 1 # How many times to reuse an input to generate a label.
然后生成的输入数组为:
Then the generated input array is:
bach_input = 1 1 1 1 1 1 5 251 371 ....
这很有意义,从2(=窗口大小)开始,然后连续.标签:
This makes sense, starts from after 2 (= window size) and then continuous. The labels:
batch_labels = 1 1 1 1 1 1 251 1 1685 371 589 ...
我对这些标签不太了解.每个输入右边应该有4个标签(窗口大小为2,每侧).但是batch_label
变量的长度相同.
I don't understand these labels very well. There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label
variable is the same length.
从tensorflow教程中:
From the tensorflow tutorial:
跳过语法模型需要两个输入.一个是一批整数 代表源上下文词,另一个代表目标 话.
The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words.
根据教程,我将两个变量声明为:
As per the tutorial, I have declared the two variables as:
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
我应该如何解释batch_labels
?
推荐答案
每个输入右边应该有4个标签(每侧窗口大小2).但是batch_label变量的长度是相同的.
There are supposed to be 4 labels for each input right (window size 2, on each side). But the batch_label variable is the same length.
键设置为 num_skips = 1
.此值定义每个单词生成的(input, label)
元组的数量.请参见下面具有不同num_skips
的示例(我的data
顺序似乎与您的顺序不同,对此感到抱歉).
The key setting is num_skips = 1
. This value defines the number of (input, label)
tuples each word generates. See the examples with different num_skips
below (my data
sequence seems to be different from yours, sorry about that).
batch, labels = generate_batch(batch_size=8, num_skips=4, skip_window=2)
它为每个单词生成4个标签,即使用整个上下文;由于该批次中仅处理batch_size=8
个单词( 12 和 6 ),其余的将进入下一个批次:
It generates 4 labels for each word, i.e. uses the whole context; since batch_size=8
only 2 words are processed in this batch (12 and 6), the rest will go into the next batch:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [12 12 12 12 6 6 6 6]
labels = [[6 3084 5239 195 195 3084 12 2]]
示例2-num_skips=2
batch, labels = generate_batch(batch_size=8, num_skips=2, skip_window=2)
在这里您希望每个单词在batch
序列中出现两次;从4个可能的单词中随机抽取2个标签:
Here you would expect each word appear twice in the batch
sequence; the 2 labels are randomly sampled from 4 possible words:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12 12 6 6 195 195 2 2]
labels = [[ 195 3084 12 195 3137 12 46 195]]
示例3-num_skips=1
batch, labels = generate_batch(batch_size=8, num_skips=1, skip_window=2)
最后,此设置与您的设置相同,每个字词仅产生一个标签;每个标签都是从4个字词的上下文中随机抽取的:
Finally, this setting, same as yours, produces exactly one label per each word; each label is drawn randomly from the 4-word context:
data = [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, ...]
batch = [ 12 6 195 2 3137 46 59 156]
labels = [[ 6 12 12 195 59 156 46 46]]
我应该如何解释batch_labels?
How should I interpret the batch_labels?
每个标签都是要从上下文中预测的中心词.但是生成的数据可能不全部包含 (context, center)
个元组,具体取决于生成器的设置.
Each label is the center word to be predicted from the context. But the generated data may take not all (context, center)
tuples, depending on the settings of the generator.
还要注意,train_labels
张量是一维的. Skip-Gram训练模型以从给定的中心词预测任何上下文词,而不是一次预测所有4个上下文词 .这解释了为什么所有训练对(12, 6)
,(12, 3084)
,(12, 5239)
和(12, 195)
都是有效的.
Also note that the train_labels
tensor is 1-dimensional. Skip-Gram trains the model to predict any context word from the given center word, not all 4 context words at once. This explains why all training pairs (12, 6)
, (12, 3084)
, (12, 5239)
and (12, 195)
are valid.
这篇关于了解word2vec(TensorFlow)中的输入和标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!