IndexError:列表索引超出范围Keras Tokenizer [英] IndexError: List Index out of range Keras Tokenizer

查看:242
本文介绍了IndexError:列表索引超出范围Keras Tokenizer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在与sentiment140数据集一起尝试并学习使用RNN进行的情感分析.我在网上找到了使用keras.imdb数据源的教程,但是我想尝试并使用自己的数据源,因此我尝试调整代码以适应我自己的数据. 教程: https://towardsdatascience.com/a- rnn-9e100627c02e对情感分析的初学者指南

I'm working with the sentiment140 dataset to try and learn sentiment analysis using RNNs. I found this tutorial online that uses the keras.imdb datasource, but I want to try and use my own datasource, so I have tried to adapt the code my own data. Tutorial: https://towardsdatascience.com/a-beginners-guide-on-sentiment-analysis-with-rnn-9e100627c02e

数据预处理包括提取序列数据,然后对其进行标记和填充,然后再将其发送到模型进行训练.我在下面的代码中执行了这些操作,但是每当我尝试进行培训时,我都会得到if isinstance(data[0], list):IndexError: list index out of range的信息.我没有定义data,因此这使我相信我做了keras或tensorflow不喜欢的事情.关于导致此错误的原因有什么想法?

The data preprocessing involves extracting series data and then tokenizing and padding it before sending it to the model for training. I performed these operations below, in my code but whenever I try to run the training I get if isinstance(data[0], list):IndexError: list index out of range. I did not define data so this leads me to believe that I did something that keras or tensorflow did not like. Any ideas as to what is causing this error?

我的数据当前为csv文件格式,标题为SENTIMENTTEXT. SENTIMENT0表示负,1表示正. TEXT是已收集的已处理推文.这是一个样本.

My data is currently in a csv file format with the headers being SENTIMENT and TEXT. SENTIMENT is 0 for negative and 1 for positive. TEXT is the processed tweet that was collected. Here is a sample.

数据集CSV(仅用于节省空间的视图线)

Dataset CSV (Only a view lines to save space)

SENTIMENT,TEXT
0,about to file tax
0,ahh i hate dogs
1,My paycheck came in today
1,lot to do before chi this weekend
1,lol love food

代码

import pandas as pd
import keras
import keras.preprocessing.text as kpt
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import json
import numpy as np


# Load in DS
df = pd.read_csv('./train.csv')
print(df.head())

#Create sequence
vocabulary_size = 1000
tokenizer = Tokenizer(num_words= vocabulary_size, split=' ')
tokenizer.fit_on_texts(df['TEXT'].values)
X_train = tokenizer.texts_to_sequences(df['TEXT'].values)

#Pad Sequence
X_train = pad_sequences(X_train)
print(X_train)

#Get Sentiment
y_train = df['SENTIMENT'].tolist()


#create model
max_words = 24
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())

model.compile(loss='binary_crossentropy', 
             optimizer='adam', 
             metrics=['accuracy'])

batch_size = 64
num_epochs = 3
X_valid, y_valid = X_train[:batch_size], y_train[:batch_size]
X_train2, y_train2 = X_train[batch_size:], y_train[batch_size:]
model.fit(X_train2, y_train2,
    validation_data=(X_valid, y_valid),
    batch_size=batch_size,
    epochs=num_epochs)

输出

Using TensorFlow backend.
   SENTIMENT                                               TEXT
0          0  aww that be bummer You shoulda get david carr ...
1          0  be upset that he can not update his facebook b...
2          0  I dive many time for the ball manage to save t...
3          0      my whole body feel itchy and like its on fire
4          0  no it be not behave at all be mad why be here ...
[[  0   0   0 ...   3  10   5]
 [  0   0   0 ...  46  47  89]
 [  0   0   0 ...  29   9  96]
 ...
 [  0   0   0 ...  30 309 310]
 [  0   0   0 ...   0   0  72]
 [  0   0   0 ...  33 312 313]]
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 24, 32)            32000
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101
=================================================================
Total params: 85,301
Trainable params: 85,301
Non-trainable params: 0
_________________________________________________________________
None
Traceback (most recent call last):
  File "mcve.py", line 50, in <module>
    epochs=num_epochs)
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 950, in fit
    batch_size=batch_size)
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py", line 787, in _standardize_user_data
    exception_prefix='target')
  File "/home/dv/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py", line 79, in standardize_input_data
    if isinstance(data[0], list):
IndexError: list index out of range

JUPYTER笔记本错误

JUPYTER NOTEBOOK ERROR

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-25-184505b70981> in <module>()
     20 model.fit(X_train2, y_train2,
     21     batch_size=batch_size,
---> 22     epochs=num_epochs)
     23 

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
    948             sample_weight=sample_weight,
    949             class_weight=class_weight,
--> 950             batch_size=batch_size)
    951         # Prepare validation data.
    952         do_validation = False

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
    785                 feed_output_shapes,
    786                 check_batch_axis=False,  # Don't enforce the batch size.
--> 787                 exception_prefix='target')
    788 
    789             # Generate sample-wise weight values given the `sample_weight` and

~/tensorflow/venv/lib/python3.6/site-packages/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
     77                              'for each key in: ' + str(names))
     78     elif isinstance(data, list):
---> 79         if isinstance(data[0], list):
     80             data = [np.asarray(d) for d in data]
     81         elif len(names) == 1 and isinstance(data[0], (float, int)):

IndexError: list index out of range

推荐答案

编辑
我以前的建议是错误的.我已经检查了您的代码并运行了它,它对我来说没有错误. 然后,我查看了源代码 standardize_input_data 功能.有一行检查data参数:

Edit
My former suggestion is wrong. I've checked your code and run it, and it works without errors for me. Then I've looked at the source code, standardize_input_data function. There's a line which checks a data argument:

def standardize_input_data(data,
                           names,
                           shapes=None,
                           check_batch_axis=True,
                           exception_prefix=''):
    """Normalizes inputs and targets provided by users.
    Users may pass data as a list of arrays, dictionary of arrays,
    or as a single array. We normalize this to an ordered list of
    arrays (same order as `names`), while checking that the provided
    arrays have shapes that match the network's expectations.
    # Arguments
        data: User-provided input data (polymorphic).
        ...

第79行:

 elif isinstance(data, list):
        if isinstance(data[0], list):
            ...

因此,在出现错误的情况下,输入数据看起来是list,但是列表的长度为零.

So, it looks like in case of error an input data is list, but a list of zero length.

通过调用Model._standardize_user_data(...)在Model.fit(...)方法内部调用standartize_input_data函数.通过此功能链,传递的data自变量将获得x自变量的值Model.fit(x, y, ...).因此,我想这是X_train2X_valid的类型或内容的问题.除了X_train内容之外,您还会提供X_train2X_val吗?

A standartize_input_data function is called inside Model.fit(...) method throught a call to Model._standardize_user_data(...). Through this chain of functions, passed data argument gets a value of x argument of Model.fit(x, y, ...). So, I guess is that the problem with type or content of X_train2 or X_valid. Would you provide X_train2 and X_val in addition to X_train content?

旧的错误建议
我想您应该将词汇量增加一倍,以应对词汇量不足的标记.
即,更改Embedding层的初始化:

Old wrong suggestion
You should increase vocabulary size by one to deal with out-of-vocabulary tokens, I guess.
I.e, change initialization of the Embedding layer:

model.add(Embedding(vocabulary_size + 1, embedding_size, input_length=max_words))

根据 docs "input_dim:int> 0.词汇,即最大整数索引+ 1.
您可以查看最高max(X_train)(已编辑)的值.

According to the docs, "input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1".
You may check a max. value of the max(X_train) (edited).

希望有帮助!

这篇关于IndexError:列表索引超出范围Keras Tokenizer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆