在多特征 TensorFlow 数据集中引用和标记单个特征列 [英] Referencing and tokenizing single feature column in multi-feature TensorFlow Dataset

查看:36
本文介绍了在多特征 TensorFlow 数据集中引用和标记单个特征列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试标记 TensorFlow 数据集中的单个列.如果只有一个特征列,我一直在使用的方法效果很好,例如:

I am attempting to tokenize a single column in a TensorFlow Dataset. The approach I've been using works well if there is only a single feature column, example:

text = ["I played it a while but it was alright. The steam was a bit of trouble."
        " The more they move these game to steam the more of a hard time I have"
        " activating and playing a game. But in spite of that it was fun, I "
        "liked it. Now I am looking forward to anno 2205 I really want to "
        "play my way to the moon.",
        "This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]

df = pd.DataFrame({"text": text,
                   "target": target})

training_dataset = (
    tf.data.Dataset.from_tensor_slices((
        tf.cast(df.text.values, tf.string), 
        tf.cast(df.target, tf.int32))))

tokenizer = tfds.features.text.Tokenizer()

lowercase = True
vocabulary = Counter()
for text, _ in training_dataset:
    if lowercase:
        text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)


vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))


encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=True,
                                              tokenizer=tokenizer)

但是,当我尝试在有一组特征列的情况下执行此操作时,例如从 make_csv_dataset(每个特征列的名称)出来时,上述方法失败.(ValueError:尝试将值(OrderedDict([]) 转换为张量.).

However when I try to do this where there are a set of feature columns, say coming out of make_csv_dataset (where each feature column is named) the above methodology fails. (ValueError: Attempt to convert a value (OrderedDict([]) to a Tensor.).

我尝试使用以下方法在 for 循环中引用特定功能列:

I attempted to reference a specific feature column within the for loop using:

text = ["I played it a while but it was alright. The steam was a bit of trouble."
        " The more they move these game to steam the more of a hard time I have"
        " activating and playing a game. But in spite of that it was fun, I "
        "liked it. Now I am looking forward to anno 2205 I really want to "
        "play my way to the moon.",
        "This game is a bit hard to get the hang of, but when you do it's great."]
target = [0, 1]
gender = [1, 0]
age = [45, 35]



df = pd.DataFrame({"text": text,
                   "target": target,
                   "gender": gender,
                   "age": age})

df.to_csv('test.csv', index=False)

dataset = tf.data.experimental.make_csv_dataset(
    'test.csv',
    batch_size=2,
    label_name='target')

tokenizer = tfds.features.text.Tokenizer()

lowercase = True
vocabulary = Counter()
for features, _ in dataset:
    text = features['text']
    if lowercase:
        text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)


vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))


encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=True,
                                              tokenizer=tokenizer)

我收到错误:预期的二进制或 unicode 字符串,得到数组([]).引用单个特征列以便标记化的正确方法是什么?通常,您可以在 .map 函数中使用 feature['column_name'] 方法引用特征列,例如:

I get the error: Expected binary or unicode string, got array([]). What is the proper way to reference a single feature column so that I can tokenize? Typically you can reference a feature column using the feature['column_name'] approach within a .map function, example:

def new_age_func(features, target):
    age = features['age']
    features['age'] = age/2
    return features, targets

dataset = dataset.map(new_age_func)

for features, target in dataset.take(2):
    print('Features: {}, Target {}'.format(features, target))

我尝试组合方法并通过地图函数生成词汇表.

I tried combining approaches and generating the vocabulary list via a map function.

tokenizer = tfds.features.text.Tokenizer()

lowercase = True
vocabulary = Counter()

def vocab_generator(features, target):
    text = features['text']
    if lowercase:
        text = tf.strings.lower(text)
        tokens = tokenizer.tokenize(text.numpy())
        vocabulary.update(tokens)

dataset = dataset.map(vocab_generator)

但这会导致错误:

AttributeError: in user code:

    <ipython-input-61-374e4c375b58>:10 vocab_generator  *
        tokens = tokenizer.tokenize(text.numpy())

    AttributeError: 'Tensor' object has no attribute 'numpy'

并将 tokenizer.tokenize(text.numpy()) 更改为 tokenizer.tokenize(text) 会引发另一个错误 TypeError: Expected binary or unicode string,得到 <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>

and changing tokenizer.tokenize(text.numpy()) to tokenizer.tokenize(text) throws another error TypeError: Expected binary or unicode string, got <tf.Tensor 'StringLower:0' shape=(2,) dtype=string>

推荐答案

错误只是 tokenizer.tokenize 需要一个字符串,而您给它的是一个列表.这个简单的编辑会起作用.我只是做了一个循环,将所有字符串提供给标记器,而不是给它一个字符串列表.

The error is just that tokenizer.tokenize expects a string and you're giving it a list. This simple edit will work. I just made a loop that gives all strings to the tokenizer instead of giving it a list of strings.

dataset = tf.data.experimental.make_csv_dataset(
    'test.csv',
    batch_size=2,
    label_name='target',
    num_epochs=1)

tokenizer = tfds.features.text.Tokenizer()

lowercase = True
vocabulary = Counter()
for features, _ in dataset:
    text = features['text']
    if lowercase:
        text = tf.strings.lower(text)
    for t in text:
        tokens = tokenizer.tokenize(t.numpy())
        vocabulary.update(tokens)

这篇关于在多特征 TensorFlow 数据集中引用和标记单个特征列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆