带有递归网络的电影评论分类 [英] Movie Review Classification with Recurrent Networks
问题描述
据我所知和研究,数据集中的序列可以具有不同的长度;只要训练过程中的每个批次都包含相同长度的序列,我们就无需填充或截断它们.
As far as I know and research, the sequences in a data set can be of different lengths; we do not need to pad or truncate them provided that each batch in the training process contains the sequences with the same length.
为实现和应用它,我决定将批处理大小设置为1,并在IMDB电影分类数据集中训练我的RNN模型.我添加了下面编写的代码.
To realize and apply it, I decided to set the batch size to 1 and trained my RNN model over the IMDB movie classification dataset. I added the code that I had written below.
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding
max_features = 10000
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop",
loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=batch_size, epochs=10,
validation_split=0.2)
acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]
epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()
由于输入numpy数组中的列表组件,遇到的错误是无法将输入转换为张量格式.但是,当我更改它们时,我仍然会遇到类似的错误.
What error I have been encountered is to fail to convert the input to tensor format because of the list components in the input numpy array. However, when I change them, I continue to get similar kinds of errors.
错误消息:
ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).
我无法解决问题.在这一点上有人可以帮助我吗?
I could not handle the problem. Could anyone help me on this point?
推荐答案
带有序列填充
有两个问题.您首先需要在文本序列上使用 pad_sequences
.而且 SimpleRNN
.尝试以下代码:
With Sequence Padding
There are two issues. You need to use pad_sequences
on the text sequence first. And also there is no such param input_shape
in SimpleRNN
. Try with the following code:
max_features = 20000 # Only consider the top 20k words
maxlen = 200 # Only consider the first 200 words of each movie review
batch_size = 1
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size,
epochs=10, validation_split=0.2)
此处是官方代码示例,它可能会对您有所帮助.
Here is the official code example, it might help you.
根据您的评论和信息,似乎可以使用可变长度输入序列,请检查这.但是,我仍然可以说,在大多数情况下,从业人员还是更愿意 pad
序列以获得统一的长度.令人信服.选择非均匀或可变的输入序列长度是一种特殊情况.类似于我们希望视觉模型的输入图像大小可变时.
Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad
the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.
但是,在这里,我们将添加有关 padding
的信息,以及我们如何 mask
消除训练时间中的填充值,这在技术上似乎是可变长度输入训练.希望能说服您.首先让我们了解 pad_sequences
的作用.通常在序列数据中,每个训练样本的长度不同是很常见的情况.让我们考虑以下输入:
However, here we will add info on padding
and how we can mask
out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences
do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:
raw_inputs = [
[711, 632, 71],
[73, 8, 3215, 55, 927],
[83, 91, 1, 645, 1253, 927],
]
这3个训练样本的长度分别为3、5和6.接下来我们要做的是通过在序列的开头或结尾添加一些值(通常为 0
或 -1
)来使它们的长度相等.
These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0
or -1
) - whether at the beginning or at the end of the sequence.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
array([[ 0, 0, 0, 711, 632, 71],
[ 0, 73, 8, 3215, 55, 927],
[ 83, 91, 1, 645, 1253, 927]], dtype=int32)
我们可以设置 padding ="post"
以在序列的末尾设置填充值.但是,建议在处理 RNN
图层时使用"post"
填充,以便能够使用图层的 CuDNN
实现.但是,仅供参考,您可能会注意到我们将 maxlen = 6
设置为最高输入序列长度.但这不必是最高的输入序列长度,因为如果数据集变大,它可能会在计算上变得昂贵.假设我们的模型可以在此长度范围内学习特征表示,我们可以将其设置为 5
,这是一种超参数.这带来了另一个参数截断
.
We can set padding = "post"
to set pad value at the end of the sequence. But it recommends using "post"
padding when working with RNN
layers in order to be able to use the CuDNN
implementation of the layers. However, FYI, you may notice we set maxlen = 6
which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5
assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating
.
tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)
array([[ 0, 0, 711, 632, 71],
[ 73, 8, 3215, 55, 927],
[ 91, 1, 645, 1253, 927]], dtype=int32
好的,现在我们有了一个填充的输入序列,所有输入都是统一长度.现在,我们可以在训练时间内 mask
那些附加的填充值.我们将告诉模型数据的某些部分被填充,而那些应该被忽略.该机制是屏蔽.因此,这是一种告诉序列处理层的方法,其中输入中的某些时间步丢失了,因此在处理数据时应将其跳过.有三种方法可以在 Keras
模型中引入输入掩码:
Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask
out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras
models:
- 添加一个
keras.layers.Masking layer
. - 使用
mask_zero = True
配置keras.layers.Embedding
层. - 在调用支持该参数的图层(例如
RNN
图层)时手动传递mask参数.
- Add a
keras. layers.Masking layer
. - Configure a
keras.layers.Embedding
layer withmask_zero=True
. - Pass a mask argument manually when calling layers that support this argument (e.g.
RNN
layers).
在这里,我们仅通过配置 Embedding
层进行显示.它有一个名为 mask_zero
的参数,默认情况下设置为 False
.如果将其设置为 True
,则将跳过序列中包含索引的 0
. False
条目指示相应的时间步长应在处理过程中被忽略.
Here we will show only by configuring the Embedding
layer. It has a parameter called mask_zero
and set False
by default. If we set it True
then 0
containing indices in the sequences will be skipped. False
entry indicates that the corresponding timestep should be ignored during processing.
padd_input = tf.keras.preprocessing.sequence.pad_sequences(
raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)
embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)
[[ 0 0 0 711 632 71]
[ 0 73 8 3215 55 927]
[ 83 91 1 645 1253 927]]
tf.Tensor(
[[False False False True True True]
[False True True True True True]
[ True True True True True True]], shape=(3, 6), dtype=bool)
这是在类 嵌入(图层)
.
def compute_mask(self, inputs, mask=None):
if not self.mask_zero:
return None
return tf.not_equal(inputs, 0)
这是一个陷阱,如果我们将 mask_zero
设置为 True
,结果,词汇表中将不能使用索引 0
.根据文档
And here is one catch, if we set mask_zero
as True
, as a consequence, index 0
cannot be used in the vocabulary. According to the doc
mask_zero:布尔值,无论输入值0是否为特殊的填充"值.应当屏蔽掉的值.当使用可能需要可变长度输入的循环图层时,这很有用.如果这是
True
,则模型中的所有后续层都需要支持屏蔽,否则将引发异常.因此,如果将mask_zero设置为True,则无法在词汇表中使用索引0(input_dim应等于词汇表大小+ 1).
mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is
True
, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).
因此,我们至少必须使用 max_features + 1
.此处是对此的很好解释.
So, we have to use max_features + 1
at least. Here is a nice explanation on this.
这是使用您的代码的完整示例.
Here is the complete example using these of your code.
# get the data
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)
# check highest sequence lenght
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))
max_features = 20000 # Only consider the top 20k words
maxlen = 350 # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
maxlen=maxlen, dtype="int32",
padding="post", truncating="post",
value=0.)
print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])
您的模型定义现在应该是
Your model definition should be now
model = Sequential()
model.add(Embedding(
input_dim=max_features + 1,
output_dim=32,
mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train,
batch_size=256,
epochs=1, validation_split=0.2)
639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036
参考
这篇关于带有递归网络的电影评论分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!