在单词索引中加1以进行序列建模的原因 [英] Reason for adding 1 to word index for sequence modeling

查看:40
本文介绍了在单词索引中加1以进行序列建模的原因的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在许多教程中,我注意到 1 已添加到 word_index 中.例如,考虑来自 Tensorflow's 教程的 NMT https://www.tensorflow.org/tutorials/text/nmt_with_attention :

I notice in many of the tutorials 1 is added to the word_index. For example considering a sample code snippet inspired from Tensorflow's tutorial for NMT https://www.tensorflow.org/tutorials/text/nmt_with_attention :

import tensorflow as tf
sample_input = ["sample sentence 1", "sample sentence 2"]
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
lang_tokenizer.fit_on_texts(sample_input)
vocab_inp_size = len(lang_tokenizer.word_index)+1

我不明白将 1 添加到 word_index字典的原因.不会添加随机1 会影响预测.任何建议都会有帮助

I dont understand the reason for adding 1 to the word_index dictionary. Wont adding a random 1 affect the prediction. Any suggestions will be helpful

推荐答案

根据文档:图层.嵌入:输入中的最大整数应小于词汇量/input_dim .

input_dim :整数.词汇表的大小,即最大整数索引+ 1 .

input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1.

这就是为什么

vocab_inp_size = len(inp_lang.word_index)  + 1
vocab_tar_size = len(targ_lang.word_index) + 1


例如,考虑以下情况,


For example, consider the following cases,

inp = np.array([
  [1, 0, 2, 0],
  [1, 1, 5, 0],
  [1, 1, 3, 0]
])
print(inp.shape, inp.max())

'''
The largest integer (i.e. word index) in the input  
should be no larger than vocabulary size or input_dim in the Embedding layer. 
'''

x = Input(shape=(4,))
e = Embedding(input_dim = inp.max() + 1 , output_dim = 5, mask_zero=False)(x)

m = Model(inputs=x, outputs=e)
m.predict(inp).shape
(3, 4) 5
(3, 4, 5)

Embedding 层的 input_dim 应该大于 inp.max(),否则将发生错误.另外, mask_zero 是默认的 False ,但是如果设置为 True ,则索引 0 可以不会用在词汇上.根据 doc :

The input_dim of the Embedding layer should be greater than inp. max(), the otherwise error will occur. Additionally, the mask_zero is the default False, but if it sets True then as a consequence, index 0 can't be used in the vocabulary. According to the doc:

mask_zero :布尔值,输入值0是否为特殊值填充"是指应当屏蔽掉的值.这在使用时很有用循环层,可能需要可变长度的输入.如果是这样是的,那么模型中的所有后续层都需要支持遮罩否则将引发异常.如果mask_zero设置为True,则为因此,无法在词汇表中使用索引0(input_dim应该等于词汇量+ 1 ).

mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

因此,如果在上述示例中将 mask_zero 设置为 True ,则 Embedding input_dim 层将是

So, if we set mask_zero to True in the above examples, then the input_dim of the Embedding layer would be

Embedding(input_dim = inp.max() + 2 , output_dim = 5, mask_zero=True)

这篇关于在单词索引中加1以进行序列建模的原因的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆