如何在map方法内部预处理和标记TensorFlow CsvDataset? [英] How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

查看:86
本文介绍了如何在map方法内部预处理和标记TensorFlow CsvDataset?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个TensorFlow CsvDataset,我正试图像这样对数据进行标记化:

I made a TensorFlow CsvDataset, and I'm trying to tokenize the data as such:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')

fname = 'rotten_tomatoes_reviews.csv'


def preprocess(target, inputs):
    tok = Tokenizer(num_words=5_000, lower=True)
    tok.fit_on_texts(inputs)
    vectors = tok.texts_to_sequences(inputs)
    return vectors, target


dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True).map(preprocess)

运行此命令,会出现以下错误:

Running this, gives the following error:

ValueError:len需要一个非标量张量,形状为Tensor("Shape:0",shape =(0,),dtype = int32)

ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)

我尝试过的事情:几乎所有可能的事情.请注意,如果删除了预处理步骤,一切都会运行.

What I've tried: just about anything in the realm of possibilities. Note that everything runs if I remove the preprocessing step.

数据如下:

(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=string, numpy=b" Some movie critic review...">)

推荐答案

首先,让我们找出代码中的问题:

First of all, let's find out the problems in your code:

  • 第一个问题,也是给定错误的原因,是fit_on_texts方法接受文本列表,而不是单个文本字符串.因此,应为:tok.fit_on_texts([inputs]).

  • The first problem, which is also the reason behind the given error, is that the fit_on_texts method accepts a list of texts, not a single text string. Therefore, it should be: tok.fit_on_texts([inputs]).

修复此问题并再次运行代码后,您将收到另一个错误:AttributeError: 'Tensor' object has no attribute 'lower'.这是由于数据集中的元素是Tensor对象,而map函数应该能够处理它们;但是,Tokenizer类并非旨在处理Tensor对象(此问题已解决,但由于下一个问题,我现在将不予解决).

After fixing that and running the code again, you would get another error: AttributeError: 'Tensor' object has no attribute 'lower'. This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them; however, the Tokenizer class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).

最大的问题是,每次调用map函数(即preprocess)时,都会创建一个Tokenizer类的新实例,并且该实例将适合单个文本文档. 更新:正如 @Princy 在注释部分正确指出的那样,fit_on_texts方法实际上是执行部分拟合(即更新或增加内部词汇统计,而不是从头开始).因此,如果我们在preprocess函数外部创建Tokenizer类,并且假设词汇集是事先已知的(否则,除非有或先建立词汇集),然后在应用上述修正后也有可能使用这种方法(即基于Tokenizer类).但是,就我个人而言,我更喜欢以下解决方案.

The biggest problem is that each time the map function, i.e. preprocess, is called, a new instance of Tokenizer class is created and it would be fit on a single text document. Update: As @Princy correctly pointed out in the comments section, the fit_on_texts method actually performs a partial fit (i.e. updates or augments the internal vocabulary stats, instead of starting from scratch). So if we create the Tokenizer class outside the preprocess function and assuming the vocabulary set is known beforehand (otherwise, you can't filter the most frequent words in a partial fit scheme unless you have or build the vocabulary set first), then it would be possible to use this approach (i.e. based on Tokenizer class) after applying the above fixes as well. However, personally, I prefer the solution below.

那么,我们该怎么办?如上所述,在处理文本数据的几乎所有模型中,我们首先需要将文本转换为数字特征,即对它们进行编码.为了执行编码,首先我们需要一个词汇集或一个令牌词典.因此,我们应该采取的步骤如下:

So, what should we do? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, i.e. encode them. For performing encoding, first we need a vocabulary set or a dictionary of tokens. Therefore, the steps we should take are as follows:

  1. 如果有预先建立的词汇表,请跳至下一步.否则,请先标记文本数据全部并建立词汇表.

使用词汇集对文本数据进行编码.

Encode the text data using the vocabulary set.

对于执行第一步,我们使用 tfds.features.text.Tokenizer 通过对数据集进行迭代来标记化文本数据并构建词汇表.

For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.

第二步,我们使用 tfds.features.text.TokenTextEncoder 使用上一步中建立的词汇集对文本数据进行编码.注意,对于这一步,我们使用map方法.但是,由于map仅在图形模式下起作用,因此我们将encode函数包装在tf.py_function中,以便可以与map一起使用.

For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step. Note that, for this step we are using map method; however, since map only functions in graph mode, we have wrapped our encode function in tf.py_function so that it could be used with map.

这里是代码(请阅读代码中的注释以获取其他要点;我没有将它们包括在答案中,因为它们并不直接相关,但是它们是有用且实用的):

Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):

import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter

fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True)

# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()

# Find unique tokens in the dataset.
lowercase = True  # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
    if lowercase:
       text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)

# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))

# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=lowercase,
                                              tokenizer=tokenizer)

# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None

def encode(target, text):
    text_encoded = encoder.encode(text.numpy())
    if max_len:
        text_encoded = text_encoded[:max_len]
    return text_encoded, target

# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
    text_encoded, target = tf.py_function(encode,
                                          inp=[target, text],
                                          Tout=(tf.int32, tf.int32))
    
    # (optional) Set the shapes for efficiency.
    text_encoded.set_shape([None])
    target.set_shape([])

    return text_encoded, target

# Apply encoding and then padding.
# Note: if you want the sequences in all the batches 
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
                                                padded_shapes=([None,], []))

# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).

给以后的读者的注释:请注意,参数的顺序(即target, text)和数据类型基于OP的数据集.根据需要调整您自己的数据集/任务(尽管最后是return text_encoded, target,我们对此进行了调整以使其与fit方法的预期格式兼容).

Side note for future readers: notice that the order of arguments, i.e. target, text, and the data types are based on the OP's dataset. Adapt as needed based on your own dataset/task (although, at the end, i.e. return text_encoded, target, we adjusted this to make it compatible with expected format of fit method).

这篇关于如何在map方法内部预处理和标记TensorFlow CsvDataset?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆