如何从 Pandas DataFrame 到用于 NLP 的 Tensorflow BatchDataset? [英] How do I go from Pandas DataFrame to Tensorflow BatchDataset for NLP?

查看:67
本文介绍了如何从 Pandas DataFrame 到用于 NLP 的 Tensorflow BatchDataset?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

老实说,我正试图弄清楚如何将数据集(格式:pandas DataFrame 或 numpy 数组)转换为一种形式,以便简单的文本分类 tensorflow 模型可以训练进行情感分析.我使用的数据集类似于 IMDB(包含文本和标签(正面或负面)).我看过的每个教程要么以不同的方式准备数据,要么不理会数据准备而将其留给您的想象.(例如,所有 IMDB 教程都从 tensorflow_datasets 导入预处理的 Tensorflow BatchDataset,这在我使用自己的数据集时没有帮助).我自己将 Pandas DataFrame 转换为 Tensorflow 的 Dataset 类型的尝试导致了 ValueErrors 或训练期间的负损失.任何帮助将不胜感激.

I'm honestly trying to figure out how to convert a dataset (format: pandas DataFrame or numpy array) to a form that a simple text-classification tensorflow model can train on for sentiment analysis. The dataset I'm using is similar to IMDB (containing both text and labels (positive or negative)). Every tutorial I've looked at has either prepared data differently, or didn't bother with data preparation and left it to your imagination. (For instance, all the IMDB tutorials import a preprocessed Tensorflow BatchDataset from tensorflow_datasets, which isn't helpful when I'm using my own set of data). My own attempts to convert a Pandas DataFrame to Tensorflow's Dataset types have resulted in ValueErrors or a negative loss during training. Any help would be appreciated.

我最初准备我的数据如下,其中 trainingvalidation 已经打乱了包含 textDataFramecode> 和 label 列:

I had originally prepared my data as follows, where training and validation are already shuffled Pandas DataFrames containing text and label columns:

# IMPORT STUFF

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf # (I'm using tensorflow 2.0)
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
import pandas as pd
import numpy as np
# ... [code for importing and preparing the pandas dataframe omitted]

# TOKENIZE

train_text = training['text'].to_numpy()
tok = Tokenizer(oov_token='<unk>')
tok.fit_on_texts(train_text)
tok.word_index['<pad>'] = 0
tok.index_word[0] = '<pad>'

train_seqs = tok.texts_to_sequences(train_text)
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

train_labels = training['label'].to_numpy().flatten()

valid_text = validation['text'].to_numpy()
valid_seqs = tok.texts_to_sequences(valid_text)
valid_seqs = tf.keras.preprocessing.sequence.pad_sequences(valid_seqs, padding='post')

valid_labels = validation['label'].to_numpy().flatten()

# CONVERT TO TF DATASETS

train_ds = tf.data.Dataset.from_tensor_slices((train_seqs,train_labels))
valid_ds = tf.data.Dataset.from_tensor_slices((valid_seqs,valid_labels))

train_ds = train_ds.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
valid_ds = valid_ds.batch(BATCH_SIZE)

# PREFETCH

train_ds = train_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
valid_ds = valid_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

这导致 train_ds 和 valid_ds 被标记化并且属于 PrefetchDataset<PrefetchDataset 形状:((None, None, None, 118), (None, None, None)),类型:(tf.int32, tf.int64)>.

This resulted train_ds and valid_ds being tokenized and of type PrefetchDataset or <PrefetchDataset shapes: ((None, None, None, 118), (None, None, None)), types: (tf.int32, tf.int64)>.

然后我进行了如下训练,但是得到了一个很大的负损失和 0 的准确度.

I then trained as follows, but got a large negative loss and an accuracy of 0.

model = keras.Sequential([
    layers.Embedding(vocab_size, embedding_dim),
    layers.GlobalAveragePooling1D(),
    layers.Dense(1, activation='sigmoid') # also tried activation='softmax'
])

model.compile(optimizer='adam',
              loss='binary_crossentropy', # binary_crossentropy
              metrics=['accuracy'])

history = model.fit(
    train_ds,
    epochs=1,
    validation_data=valid_ds, validation_steps=1, steps_per_epoch=BUFFER_SIZE)

如果我不做花哨的预取操作,train_ds 的类型将是 BatchDataset<BatchDataset 形状:((None, 118),(None,)),类型:(tf.int32, tf.int64)>,但这也给我带来了负损失和 0 的准确度.

If I don't do the fancy prefetch stuff, train_ds would be of type BatchDataset or <BatchDataset shapes: ((None, 118), (None,)), types: (tf.int32, tf.int64)>, but that also is getting me a negative loss and an accuracy of 0.

如果我只执行以下操作:

And if I just do the following:

x, y = training['text'].to_numpy(), training['label'].to_numpy()
x, y = tf.convert_to_tensor(x),tf.convert_to_tensor(y)

then xy 属于 EagerTensor 类型,但我似乎无法弄清楚如何批处理 EagerTensor.

then x and y are of type EagerTensor, but I can't seem to figure out how to Batch an EagerTensor.

train_ds 真正需要哪些类型和形状?我错过了什么或做错了什么?

What types and shapes do I really need for train_ds? What am I missing or doing wrong?

text_classification_with_hub 教程训练一个已经准备好的 imdb 数据集,如下所示:

The text_classification_with_hub tutorial trains an already prepared imdb dataset as shown:

model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=20,
                    validation_data=validation_data.batch(512),
                    verbose=1)

在这个例子中,train_data 的形式是 tensorflow.python.data.ops.dataset_ops._OptionsDataset,而 train_data.shuffle(1000).batch(512)tensorflow.python.data.ops.dataset_ops.BatchDataset(或 ).

In this example, train_data is of form tensorflow.python.data.ops.dataset_ops._OptionsDataset, and train_data.shuffle(1000).batch(512) is tensorflow.python.data.ops.dataset_ops.BatchDataset (or <BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int64)>).

他们显然不关心这个数据集的标记化,但我怀疑标记化是我的问题.为什么他们的 train_data.shuffle(10000).batch(512) 有效,而我的 train_ds 无效?

They apparently didn't bother with tokenization with this dataset, but I doubt tokenization is my issue. Why does their train_data.shuffle(10000).batch(512) work but my train_ds not work?

问题可能出在模型设置、Embedding 层或标记化上,但我不太确定是这种情况.我已经看过以下教程以获得灵感:

It's possible the issue is with the model setup, the Embedding layer, or with tokenization, but I'm not so sure that's the case. I've already looked at the following tutorials for inspiration:

https://www.tensorflow.org/tutorials/keras/text_classification_with_hub

https://www.kaggle.com/drscarlat/imdb-情绪分析-keras-and-tensorflow

https://www.tensorflow.org/tutorials/text/image_captioning

https://www.tensorflow.org/tutorials/text/word_embeddings#learning_embeddings_from_scratch

https://thedatafrog.com/word-embedding-sentiment-analysis/

推荐答案

更新: 我发现问题在于我忽略了将目标标签转换为 0 和 1 以进行二元交叉熵.该问题与转换为 Tensorflow 数据集类型无关.我上面的代码可以很好地完成这个任务.

UPDATE: I figured out that the issue was that I neglected to convert my target labels to 0 and 1 for binary cross entropy. The problem had nothing to do with converting to a Tensorflow Dataset type. My above code works fine for accomplishing that.

这篇关于如何从 Pandas DataFrame 到用于 NLP 的 Tensorflow BatchDataset?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆