在单词索引中加1以进行序列建模的原因 [英] Reason for adding 1 to word index for sequence modeling
问题描述
在许多教程中,我注意到 1
已添加到 word_index
中.例如,考虑来自 Tensorflow's
教程的 NMT
https://www.tensorflow.org/tutorials/text/nmt_with_attention :
I notice in many of the tutorials 1
is added to the word_index
. For example considering a sample code snippet inspired from Tensorflow's
tutorial for NMT
https://www.tensorflow.org/tutorials/text/nmt_with_attention :
import tensorflow as tf
sample_input = ["sample sentence 1", "sample sentence 2"]
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
lang_tokenizer.fit_on_texts(sample_input)
vocab_inp_size = len(lang_tokenizer.word_index)+1
我不明白将 1
添加到 word_index字典
的原因.不会添加随机1
会影响预测.任何建议都会有帮助
I dont understand the reason for adding 1
to the word_index dictionary
. Wont adding a random 1
affect the prediction. Any suggestions will be helpful
推荐答案
根据文档:图层.嵌入:输入中的最大整数应小于词汇量/input_dim .
input_dim :整数.词汇表的大小,即最大整数索引+ 1 .
input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.
这就是为什么
vocab_inp_size = len(inp_lang.word_index) + 1
vocab_tar_size = len(targ_lang.word_index) + 1
例如,考虑以下情况,
For example, consider the following cases,
inp = np.array([
[1, 0, 2, 0],
[1, 1, 5, 0],
[1, 1, 3, 0]
])
print(inp.shape, inp.max())
'''
The largest integer (i.e. word index) in the input
should be no larger than vocabulary size or input_dim in the Embedding layer.
'''
x = Input(shape=(4,))
e = Embedding(input_dim = inp.max() + 1 , output_dim = 5, mask_zero=False)(x)
m = Model(inputs=x, outputs=e)
m.predict(inp).shape
(3, 4) 5
(3, 4, 5)
Embedding
层的 input_dim
应该大于 inp.max()
,否则将发生错误.另外, mask_zero
是默认的 False
,但是如果设置为 True
,则索引 0
可以不会用在词汇上.根据 doc :
The input_dim
of the Embedding
layer should be greater than inp. max()
, the otherwise error will occur. Additionally, the mask_zero
is the default False
, but if it sets True
then as a consequence, index 0
can't be used in the vocabulary. According to the doc:
mask_zero :布尔值,输入值0是否为特殊值填充"是指应当屏蔽掉的值.这在使用时很有用循环层,可能需要可变长度的输入.如果是这样是的,那么模型中的所有后续层都需要支持遮罩否则将引发异常.如果mask_zero设置为True,则为因此,无法在词汇表中使用索引0(input_dim应该等于词汇量+ 1 ).
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
因此,如果在上述示例中将 mask_zero
设置为 True
,则 Embedding
的 input_dim
层将是
So, if we set mask_zero
to True
in the above examples, then the input_dim
of the Embedding
layer would be
Embedding(input_dim = inp.max() + 2 , output_dim = 5, mask_zero=True)
这篇关于在单词索引中加1以进行序列建模的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!