带有字符串输入的 Tensorflow 数据集不保留数据类型 [英] Tensorflow Datasets with string inputs do not preserve data type

查看:19
本文介绍了带有字符串输入的 Tensorflow 数据集不保留数据类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下所有可重现的代码均在 Google Colab 中使用 TF 2.2.0-rc2 运行.

All reproducible code below is run at Google Colab with TF 2.2.0-rc2.

改编文档中的简单示例以创建数据集来自一个简单的 Python 列表:

Adapting the simple example from the documentation for creating a dataset from a simple Python list:

import numpy as np
import tensorflow as tf
tf.__version__
# '2.2.0-rc2'
np.version.version
# '1.18.2'

dataset1 = tf.data.Dataset.from_tensor_slices([1, 2, 3]) 
for element in dataset1: 
  print(element) 
  print(type(element.numpy()))

我们得到结果

tf.Tensor(1, shape=(), dtype=int32)
<class 'numpy.int32'>
tf.Tensor(2, shape=(), dtype=int32)
<class 'numpy.int32'>
tf.Tensor(3, shape=(), dtype=int32)
<class 'numpy.int32'>

其中所有数据类型都是 int32,正如预期的那样.

where all data types are int32, as expected.

但是改变这个简单的例子来提供一个字符串列表而不是整数:

But changing this simple example to feed a list of strings instead of integers:

dataset2 = tf.data.Dataset.from_tensor_slices(['1', '2', '3']) 
for element in dataset2: 
  print(element) 
  print(type(element.numpy()))

给出结果

tf.Tensor(b'1', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'2', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'3', shape=(), dtype=string)
<class 'bytes'>

令人惊讶的是,尽管张量本身是 dtype=string,但它们的计算类型是 bytes.

where, surprisingly, and despite the tensors themselves being of dtype=string, their evaluations are of type bytes.

这种行为不仅限于 .from_tensor_slices 方法;这是 .list_files(以下代码段在新的 Colab 笔记本中直接运行):

This behavior is not confined to the .from_tensor_slices method; here is the situation with .list_files (the following snippet runs straightforward in a fresh Colab notebook):

disc_data = tf.data.Dataset.list_files('sample_data/*.csv') # 4 csv files
for element in disc_data: 
  print(element) 
  print(type(element.numpy()))

结果是:

tf.Tensor(b'sample_data/california_housing_test.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/mnist_train_small.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/california_housing_train.csv', shape=(), dtype=string)
<class 'bytes'>
tf.Tensor(b'sample_data/mnist_test.csv', shape=(), dtype=string)
<class 'bytes'>

同样,评估张量中的文件名返回为 bytes,而不是 string,尽管张量本身是 dtype=string.

where again, the file names in the evaluated tensors are returned as bytes, instead of string, despite that the tensors themselves are of dtype=string.

使用 .from_generator 方法(此处未显示)也观察到类似的行为.

Similar behavior is observed also with the .from_generator method (not shown here).

最后的演示:如.as_numpy_iterator方法中所示文档,以下相等条件被评估为True:

A final demonstration: as shown in the .as_numpy_iterator method documentation, the following equality condition is evaluated as True:

dataset3 = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]), 
                                               'b': [5, 6]}) 

list(dataset3.as_numpy_iterator()) == [{'a': (1, 3), 'b': 5}, 
                                       {'a': (2, 4), 'b': 6}] 
# True

但是如果我们将 b 的元素更改为字符串,那么相等条件现在令人惊讶地评估为 False

but if we change the elements of b to be strings, the equality condition is now surprisingly evaluated as False!

dataset4 = tf.data.Dataset.from_tensor_slices({'a': ([1, 2], [3, 4]), 
                                               'b': ['5', '6']})   # change elements of b to strings

list(dataset4.as_numpy_iterator()) == [{'a': (1, 3), 'b': '5'},   # here
                                       {'a': (2, 4), 'b': '6'}]   # also
# False

可能是由于不同的数据类型,因为值本身显然是相同的.

probably due to the different data types, since the values themselves are evidently identical.

我不是通过学术实验偶然发现这种行为的;我正在尝试使用自定义函数将我的数据传递给 TF 数据集,这些函数从表单的磁盘中读取成对的文件

I didn't stumble upon this behavior by academic experimentation; I am trying to pass my data to TF Datasets using custom functions that read pairs of files from the disk of the form

f = ['filename1', 'filename2']

哪些自定义函数可以很好地独立工作,但通过 TF 数据集映射给

which custom functions work perfectly well on their own, but mapped through TF Datasets give

RuntimeError: not a string

如果返回的数据类型确实是 bytes 而不是 string,那么在此挖掘之后,这似乎至少不是无法解释的.

which, after this digging, seems at least not unexplained, if the returned data types are indeed bytes and not string.

那么,这是一个错误(看起来),还是我在这里遗漏了什么?

So, is this a bug (as it seems), or am I missing something here?

推荐答案

这是一个已知行为:

来自:https://github.com/tensorflow/tensorflow/issues/5552#issuecomment-260455136

TensorFlow 在大多数地方将 str 转换为字节,包括 sess.run,这不太可能改变.用户可以自由地转换回来,但不幸的是,向核心添加 unicode dtype 的更改太大了.暂时关闭无法修复.

TensorFlow converts str to bytes in most places, including sess.run, and this is unlikely to change. The user is free to convert back, but unfortunately it's too large a change to add a unicode dtype to the core. Closing as won't fix for now.

我想 TensorFlow 2.x 没有任何改变 - 仍有一些地方将字符串转换为字节,您必须手动处理.

I guess nothing changed with TensorFlow 2.x - there are still places in which strings are converted to bytes and you have to take care of this manually.

从您自己打开的问题来看,他们似乎对待主题是 Numpy 的问题,而不是 Tensorflow 本身的问题.

From the issue you have opened yourself, it would seem that they treat the subject as a problem of Numpy, and not of Tensorflow itself.

这篇关于带有字符串输入的 Tensorflow 数据集不保留数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆