带有递归网络的电影评论分类 [英] Movie Review Classification with Recurrent Networks

查看:57
本文介绍了带有递归网络的电影评论分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知和研究,数据集中的序列可以具有不同的长度;只要训练过程中的每个批次都包含相同长度的序列,我们就无需填充或截断它们.

As far as I know and research, the sequences in a data set can be of different lengths; we do not need to pad or truncate them provided that each batch in the training process contains the sequences with the same length.

为实现和应用它,我决定将批处理大小设置为1,并在IMDB电影分类数据集中训练我的RNN模型.我添加了下面编写的代码.

To realize and apply it, I decided to set the batch size to 1 and trained my RNN model over the IMDB movie classification dataset. I added the code that I had written below.

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding

max_features = 10000
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", 
                  loss="binary_crossentropy", metrics=["acc"])

history = model.fit(x_train, y_train, 
                     batch_size=batch_size, epochs=10, 
                     validation_split=0.2)

acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]

epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()

由于输入numpy数组中的列表组件,遇到的错误是无法将输入转换为张量格式.但是,当我更改它们时,我仍然会遇到类似的错误.

What error I have been encountered is to fail to convert the input to tensor format because of the list components in the input numpy array. However, when I change them, I continue to get similar kinds of errors.

错误消息:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

我无法解决问题.在这一点上有人可以帮助我吗?

I could not handle the problem. Could anyone help me on this point?

推荐答案

带有序列填充

有两个问题.您首先需要在文本序列上使用 pad_sequences .而且 SimpleRNN .尝试以下代码:

With Sequence Padding

There are two issues. You need to use pad_sequences on the text sequence first. And also there is no such param input_shape in SimpleRNN. Try with the following code:

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)


model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size, 
                         epochs=10, validation_split=0.2)

此处是官方代码示例,它可能会对您有所帮助.

Here is the official code example, it might help you.

根据您的评论和信息,似乎可以使用可变长度输入序列,请检查.但是,我仍然可以说,在大多数情况下,从业人员还是更愿意 pad 序列以获得统一的长度.令人信服.选择非均匀或可变的输入序列长度是一种特殊情况.类似于我们希望视觉模型的输入图像大小可变时.

Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.

但是,在这里,我们将添加有关 padding 的信息,以及我们如何 mask 消除训练时间中的填充值,这在技术上似乎是可变长度输入训练.希望能说服您.首先让我们了解 pad_sequences 的作用.通常在序列数据中,每个训练样本的长度不同是很常见的情况.让我们考虑以下输入:

However, here we will add info on padding and how we can mask out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:

raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

这3个训练样本的长度分别为3、5和6.接下来我们要做的是通过在序列的开头或结尾添加一些值(通常为 0 -1 )来使它们的长度相等.

These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0 or -1) - whether at the beginning or at the end of the sequence.

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)

array([[   0,    0,    0,  711,  632,   71],
       [   0,   73,    8, 3215,   55,  927],
       [  83,   91,    1,  645, 1253,  927]], dtype=int32)

我们可以设置 padding ="post" 以在序列的末尾设置填充值.但是,建议在处理 RNN 图层时使用"post" 填充,以便能够使用图层的 CuDNN 实现.但是,仅供参考,您可能会注意到我们将 maxlen = 6 设置为最高输入序列长度.但这不必是最高的输入序列长度,因为如果数据集变大,它可能会在计算上变得昂贵.假设我们的模型可以在此长度范围内学习特征表示,我们可以将其设置为 5 ,这是一种超参数.这带来了另一个参数截断.

We can set padding = "post" to set pad value at the end of the sequence. But it recommends using "post" padding when working with RNN layers in order to be able to use the CuDNN implementation of the layers. However, FYI, you may notice we set maxlen = 6 which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5 assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating.

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)

array([[   0,    0,  711,  632,   71],
       [  73,    8, 3215,   55,  927],
       [  91,    1,  645, 1253,  927]], dtype=int32

好的,现在我们有了一个填充的输入序列,所有输入都是统一长度.现在,我们可以在训练时间内 mask 那些附加的填充值.我们将告诉模型数据的某些部分被填充,而那些应该被忽略.该机制是屏蔽.因此,这是一种告诉序列处理层的方法,其中输入中的某些时间步丢失了,因此在处理数据时应将其跳过.有三种方法可以在 Keras 模型中引入输入掩码:

Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras models:

  • 添加一个 keras.layers.Masking layer .
  • 使用 mask_zero = True 配置 keras.layers.Embedding 层.
  • 在调用支持该参数的图层(例如 RNN 图层)时手动传递mask参数.
  • Add a keras. layers.Masking layer.
  • Configure a keras.layers.Embedding layer with mask_zero=True.
  • Pass a mask argument manually when calling layers that support this argument (e.g. RNN layers).

在这里,我们仅通过配置 Embedding 层进行显示.它有一个名为 mask_zero 的参数,默认情况下设置为 False .如果将其设置为 True ,则将跳过序列中包含索引的 0 . False 条目指示相应的时间步长应在处理过程中被忽略.

Here we will show only by configuring the Embedding layer. It has a parameter called mask_zero and set False by default. If we set it True then 0 containing indices in the sequences will be skipped. False entry indicates that the corresponding timestep should be ignored during processing.

padd_input = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)

embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)

[[   0    0    0  711  632   71]
 [   0   73    8 3215   55  927]
 [  83   91    1  645 1253  927]]

tf.Tensor(
[[False False False  True  True  True]
 [False  True  True  True  True  True]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)

这是在类 嵌入(图层) .

  def compute_mask(self, inputs, mask=None):
    if not self.mask_zero:
      return None

    return tf.not_equal(inputs, 0)

这是一个陷阱,如果我们将 mask_zero 设置为 True ,结果,词汇表中将不能使用索引 0 .根据文档

And here is one catch, if we set mask_zero as True, as a consequence, index 0 cannot be used in the vocabulary. According to the doc

mask_zero:布尔值,无论输入值0是否为特殊的填充"值.应当屏蔽掉的值.当使用可能需要可变长度输入的循环图层时,这很有用.如果这是 True ,则模型中的所有后续层都需要支持屏蔽,否则将引发异常.因此,如果将mask_zero设置为True,则无法在词汇表中使用索引0(input_dim应等于词汇表大小+ 1).

mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

因此,我们至少必须使用 max_features + 1 .此处是对此的很好解释.

So, we have to use max_features + 1 at least. Here is a nice explanation on this.

这是使用您的代码的完整示例.

Here is the complete example using these of your code.

# get the data 
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)

# check highest sequence lenght 
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))

max_features = 20000  # Only consider the top 20k words
maxlen = 350  # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words 
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, 
                                  maxlen=maxlen, dtype="int32", 
                                  padding="post", truncating="post", 
                                  value=0.)

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

您的模型定义现在应该是

Your model definition should be now

model = Sequential()
model.add(Embedding(
           input_dim=max_features + 1,
           output_dim=32, 
           mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, 
                    batch_size=256, 
                    epochs=1, validation_split=0.2)

639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036


参考

这篇关于带有递归网络的电影评论分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆