使用循环网络的电影评论分类 [英] Movie Review Classification with Recurrent Networks

查看:28
本文介绍了使用循环网络的电影评论分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知和研究,数据集中的序列可以有不同的长度;我们不需要填充或截断它们,前提是训练过程中的每个批次都包含相同长度的序列.

As far as I know and research, the sequences in a data set can be of different lengths; we do not need to pad or truncate them provided that each batch in the training process contains the sequences with the same length.

为了实现和应用它,我决定将批量大小设置为 1,并在 IMDB 电影分类数据集上训练我的 RNN 模型.我添加了我在下面编写的代码.

To realize and apply it, I decided to set the batch size to 1 and trained my RNN model over the IMDB movie classification dataset. I added the code that I had written below.

import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import SimpleRNN
from tensorflow.keras.layers import Embedding

max_features = 10000
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=32))
model.add(SimpleRNN(units=32, input_shape=(None, 32)))
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="rmsprop", 
                  loss="binary_crossentropy", metrics=["acc"])

history = model.fit(x_train, y_train, 
                     batch_size=batch_size, epochs=10, 
                     validation_split=0.2)

acc = history.history["acc"]
loss = history.history["loss"]
val_acc = history.history["val_acc"]
val_loss = history.history["val_loss"]

epochs = range(len(acc) + 1)
plt.plot(epochs, acc, "bo", label="Training Acc")
plt.plot(epochs, val_acc, "b", label="Validation Acc")
plt.title("Training and Validation Accuracy")
plt.legend()
plt.figure()
plt.plot(epochs, loss, "bo", label="Training Loss")
plt.plot(epochs, val_loss, "b", label="Validation Loss")
plt.title("Training and Validation Loss")
plt.legend()
plt.show()

我遇到的错误是由于输入numpy数组中的列表组件导致无法将输入转换为张量格式.但是,当我更改它们时,我继续遇到类似的错误.

What error I have been encountered is to fail to convert the input to tensor format because of the list components in the input numpy array. However, when I change them, I continue to get similar kinds of errors.

错误信息:

ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type list).

我无法处理这个问题.在这一点上有人可以帮助我吗?

I could not handle the problem. Could anyone help me on this point?

推荐答案

With Sequence Padding

有两个问题.您需要先在文本序列上使用 pad_sequences.而且SimpleRNN<中也没有这样的参数input_shape/代码>.尝试使用以下代码:

With Sequence Padding

There are two issues. You need to use pad_sequences on the text sequence first. And also there is no such param input_shape in SimpleRNN. Try with the following code:

max_features = 20000  # Only consider the top 20k words
maxlen = 200  # Only consider the first 200 words of each movie review
batch_size = 1

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), "Training sequences")
print(len(x_test), "Validation sequences")
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)


model = Sequential()
model.add(Embedding(input_dim=max_features, output_dim=32))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, batch_size=batch_size, 
                         epochs=10, validation_split=0.2)

这里是官方代码示例,可能对你有帮助.

Here is the official code example, it might help you.

根据您的评论和信息,似乎可以使用可变长度输入序列,请查看this这个 也是.但是,我仍然可以说,在大多数情况下,从业者更喜欢 pad 序列的统一长度;因为它有说服力.选择非均匀或可变的输入序列长度是某种特殊情况;类似于我们希望视觉模型的输入图像大小可变.

Based on your comments and information, It seems that it's possible to use a variable-length input sequence, check this and this too. But still, I can say, in most of the cases practitioner would prefer to pad the sequences for uniform length; as it's convincing. Choosing non-uniform or variable input sequence length is some kind of special case; similar to when we want variable input image sizes for vision models.

然而,在这里我们将添加关于 padding 的信息,以及我们如何在训练时间内mask 去除填充值,这在技术上似乎是可变长度的输入训练.希望能说服你.让我们首先了解 pad_sequences 是做什么的.通常在序列数据中,很常见的情况是,每个训练样本的长度不同.让我们考虑以下输入:

However, here we will add info on padding and how we can mask out the padded value in training time which technically seems variable-length input training. Hope that convinces you. Let's first understand what pad_sequences do. Normally in sequence data, it's very much a common case that, each training samples are in a different length. Let's consider the following inputs:

raw_inputs = [
    [711, 632, 71],
    [73, 8, 3215, 55, 927],
    [83, 91, 1, 645, 1253, 927],
]

这 3 个训练样本的长度不同,分别为 3、5 和 6.我们接下来要做的是通过添加一些值(通常是 0-1)使它们的长度都相等——无论是在序列的开头还是结尾.

These 3 training samples are in different lengths, 3, 5, and 6 respectively. What we do next is to make them all equal lengths by adding some value (typically 0 or -1) - whether at the beginning or at the end of the sequence.

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)

array([[   0,    0,    0,  711,  632,   71],
       [   0,   73,    8, 3215,   55,  927],
       [  83,   91,    1,  645, 1253,  927]], dtype=int32)

我们可以设置 padding = "post" 来设置序列末尾的填充值.但它建议在使用 RNN 层时使用 post" 填充,以便能够使用层的 CuDNN 实现.但是,仅供参考,您可能会注意到我们设置了 maxlen = 6 这是最高的输入序列长度.但它不必是最高的输入序列长度,因为如果数据集变大,它的计算成本可能会很高.我们可以将它设置为 5 假设我们的模型可以在这个长度内学习特征表示,它是一种超参数.这带来了另一个参数截断.

We can set padding = "post" to set pad value at the end of the sequence. But it recommends using "post" padding when working with RNN layers in order to be able to use the CuDNN implementation of the layers. However, FYI, you may notice we set maxlen = 6 which is the highest input sequence length. But it does not have to be the highest input sequence length as it may get computationally expensive if the dataset gets bigger. We can set it to 5 assuming that our model can learn feature representation within this length, it's a kind of hyper-parameter. And that brings another parameter truncating.

tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=5, dtype="int32", padding="pre", truncating="pre", value=0.0
)

array([[   0,    0,  711,  632,   71],
       [  73,    8, 3215,   55,  927],
       [  91,    1,  645, 1253,  927]], dtype=int32

好的,现在我们有一个填充的输入序列,所有的输入都是统一的长度.现在,我们可以在训练时间mask 去掉那些额外的填充值.我们会告诉模型数据的某些部分是填充的,这些应该被忽略.这种机制是掩蔽.因此,这是一种告诉 sequence-processing 层输入中缺少某些时间步长的方法,因此在处理数据时应该跳过.在Keras模型中引入输入掩码的三种方式:

Okay, now we have a padded input sequence, all inputs are uniform length. Now, we can mask out those additional padded values in training time. We will tell the model some part of the data is padding and those should be ignored. That mechanism is masking. So, it's a way to tell sequence-processing layers that certain timesteps in the input are missing, and thus should be skipped when processing the data. There are three ways to introduce input masks in Keras models:

  • 添加keras.layer.Masking layer.
  • 使用 mask_zero=True 配置 keras.layers.Embedding 层.
  • 在调用支持此参数的层(例如 RNN 层)时手动传递掩码参数.
  • Add a keras. layers.Masking layer.
  • Configure a keras.layers.Embedding layer with mask_zero=True.
  • Pass a mask argument manually when calling layers that support this argument (e.g. RNN layers).

这里我们只通过配置Embedding层来展示.它有一个名为 mask_zero 的参数,默认设置为 False.如果我们将它设置为 True,那么包含序列中索引的 0 将被跳过.False 条目表示在处理过程中应忽略相应的时间步.

Here we will show only by configuring the Embedding layer. It has a parameter called mask_zero and set False by default. If we set it True then 0 containing indices in the sequences will be skipped. False entry indicates that the corresponding timestep should be ignored during processing.

padd_input = tf.keras.preprocessing.sequence.pad_sequences(
    raw_inputs, maxlen=6, dtype="int32", padding="pre", value=0.0
)
print(padd_input)

embedding = tf.keras.layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
masked_output = embedding(padd_input)
print(masked_output._keras_mask)

[[   0    0    0  711  632   71]
 [   0   73    8 3215   55  927]
 [  83   91    1  645 1253  927]]

tf.Tensor(
[[False False False  True  True  True]
 [False  True  True  True  True  True]
 [ True  True  True  True  True  True]], shape=(3, 6), dtype=bool)

这是在类中的计算方式 嵌入(层).

And here is how it's computed in the class Embedding(Layer).

  def compute_mask(self, inputs, mask=None):
    if not self.mask_zero:
      return None

    return tf.not_equal(inputs, 0)

这里有一个问题,如果我们将 mask_zero 设置为 True,结果,索引 0 不能在词汇表中使用.根据文档

And here is one catch, if we set mask_zero as True, as a consequence, index 0 cannot be used in the vocabulary. According to the doc

mask_zero:布尔值,输入值 0 是否为特殊的填充"应该屏蔽的值.这在使用可能需要可变长度输入的循环层时很有用.如果这是True,那么模型中的所有后续层都需要支持掩码,否则将引发异常.如果 mask_zero 设置为 True,则索引 0 不能在词汇表中使用(input_dim 应等于词汇表的大小 + 1).

mask_zero: Boolean, whether or not the input value 0 is a special "padding" value that should be masked out. This is useful when using recurrent layers which may take variable length input. If this is True, then all subsequent layers in the model need to support masking or an exception will be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the vocabulary (input_dim should equal size of vocabulary + 1).

所以,我们至少要使用max_features + 1.这里对此有很好的解释.

So, we have to use max_features + 1 at least. Here is a nice explanation on this.

这是使用这些代码的完整示例.

Here is the complete example using these of your code.

# get the data 
(x_train, y_train), (_, _) = imdb.load_data(num_words=max_features)
print(x_train.shape)

# check highest sequence lenght 
max_list_length = lambda list: max( [len(i) for i in list])
print(max_list_idx(x_train))

max_features = 20000  # Only consider the top 20k words
maxlen = 350  # Only consider the first 350 words out of `max_list_idx(x_train)`
batch_size = 512

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

# (1). padding with value 0 at the end of the sequence - padding="post", value=0.
# (2). truncate 'maxlen' words 
# out of `max_list_idx(x_train)` at the end - maxlen=maxlen, truncating="post"
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train, 
                                  maxlen=maxlen, dtype="int32", 
                                  padding="post", truncating="post", 
                                  value=0.)

print('Length ', len(x_train[0]), x_train[0])
print('Length ', len(x_train[1]), x_train[1])
print('Length ', len(x_train[2]), x_train[2])

您的模型定义现在应该是

Your model definition should be now

model = Sequential()
model.add(Embedding(
           input_dim=max_features + 1,
           output_dim=32, 
           mask_zero=True))
model.add(SimpleRNN(units=32))
model.add(Dense(1, activation="sigmoid"))

model.compile(optimizer="rmsprop", loss="binary_crossentropy", metrics=["acc"])
history = model.fit(x_train, y_train, 
                    batch_size=256, 
                    epochs=1, validation_split=0.2)

639ms/step - loss: 0.6774 - acc: 0.5640 - val_loss: 0.5034 - val_acc: 0.8036


参考资料

这篇关于使用循环网络的电影评论分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆