Tensorflow 2.0:以功能性方式将数据集的数字特征打包在一起 [英] Tensorflow 2.0: Packing numerical features of a dataset together in a functional way

查看:216
本文介绍了Tensorflow 2.0:以功能性方式将数据集的数字特征打包在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从

I am trying to reproduce Tensorflow tutorial code from here which is supposed to download CSV file and preprocess data (up to combining numerical data together).

可重现的示例如下:

import tensorflow as tf
print("TF version is: {}".format(tf.__version__))

# Download data
train_url = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
test_url  = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_path = tf.keras.utils.get_file("train.csv", train_url)
test_path  = tf.keras.utils.get_file("test.csv",  test_url)


# Get data into batched dataset
def get_dataset(path):
    dataset = tf.data.experimental.make_csv_dataset(path
                                                   ,batch_size=5
                                                   ,num_epochs=1
                                                   ,label_name='survived'
                                                   ,na_value='?'
                                                   ,ignore_errors=True)
    return dataset

raw_train_dataset = get_dataset(train_path)
raw_test_dataset  = get_dataset(test_path)

# Define numerical and categorical column lists
def get_df_batch(dataset):
    for batch,label in dataset.take(1):
        df = pd.DataFrame()
        df['survived'] = label.numpy()
        for key, value in batch.items():
            df[key] = value.numpy()
        return df

dfb = get_df_batch(raw_train_dataset)
num_columns = [i for i in dfb if (dfb[i].dtype != 'O' and i!='survived')]
cat_columns = [i for i in dfb if dfb[i].dtype == 'O']


# Combine numerical columns into one `numerics` column
class Pack():
    def __init__(self,names):
        self.names = names
    def __call__(self,features, labels):
        num_features = [features.pop(name) for name in self.names]
        num_features = [tf.cast(feat, tf.float32) for feat in num_features]
        num_features = tf.stack(num_features, axis=1)
        features["numerics"] = num_features
        return features, labels

packed_train = raw_train_dataset.map(Pack(num_columns))


# Show what we got
def show_batch(dataset):
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print("{:20s}: {}".format(key,value.numpy()))

show_batch(packed_train)

TF version is: 2.0.0
sex                 : [b'female' b'female' b'male' b'male' b'male']
class               : [b'Third' b'First' b'Second' b'First' b'Third']
deck                : [b'unknown' b'E' b'unknown' b'C' b'unknown']
embark_town         : [b'Queenstown' b'Cherbourg' b'Southampton' b'Cherbourg' b'Queenstown']
alone               : [b'n' b'n' b'y' b'n' b'n']
numerics            : [[ 28.       1.       0.      15.5   ]
 [ 40.       1.       1.     134.5   ]
 [ 32.       0.       0.      10.5   ]
 [ 49.       1.       0.      89.1042]
 [  2.       4.       1.      29.125 ]]

然后,我尝试但以失败告终,以一种实用的方式组合了数字特征:

Then I try, and fail, combine numeric features in a functional way:

@tf.function
def pack_func(row, num_columns=num_columns):
    features, labels = row
    num_features = [features.pop(name) for name in num_columns]
    num_features = [tf.cast(feat, tf.float32) for feat in num_features]
    num_features = tf.stack(num_features, axis=1)
    features['numerics'] = num_features
    return features, labels

packed_train = raw_train_dataset.map(pack_func)

部分回溯:

ValueError:转换后的代码中: :3 pack_func * 功能,标签=行 ValueError:太多值无法解包(预期2)

ValueError: in converted code: :3 pack_func * features, labels = row ValueError: too many values to unpack (expected 2)

这里有2个问题:

  1. 如何在类Pack的定义中的def __call__(self,features, labels):中分配featureslabels.我的直觉是应该将它们作为定义的变量传递,尽管我绝对不明白它们在哪里定义.

  1. How features and labels are get assigned in def __call__(self,features, labels): in the definition of Class Pack. My intuition they should be passed in as defined variables, though I absolutely do not understand where they get defined.

当我这样做

for row in raw_train_dataset.take(1):
    print(type(row))
    print(len(row))
    f,l = row
    print(f)
    print(l)

我看到raw_train_dataset中的row是一个tuple2,可以成功将其解压缩为要素和标签.为什么不能通过map API完成?您能提出以功能方式组合数字特征的正确方法吗?

I see that row in raw_train_dataset is a tuple2, which can be successfully unpacked into features and labels. Why it cannot be done via map API? Can you suggest the right way of combining numerical features in functional way?

非常感谢!

推荐答案

经过研究和试用,第二个问题的答案似乎是:

After some research and trial the answer to the second question seems to be:

def pack_func(features, labels, num_columns=num_columns):
    num_features = [features.pop(name) for name in num_columns]
    num_features = [tf.cast(feat, tf.float32) for feat in num_features]
    num_features = tf.stack(num_features, axis=1)
    features['numerics'] = num_features
    return features, labels

packed_train = raw_train_dataset.map(pack_func)

show_batch(packed_train)

sex                 : [b'male' b'male' b'male' b'female' b'male']
class               : [b'Third' b'Third' b'Third' b'First' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'E' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Queenstown' b'Cherbourg' b'Queenstown']
alone               : [b'y' b'n' b'n' b'n' b'y']
numerics            : [[24.      0.      0.      8.05  ]
 [14.      5.      2.     46.9   ]
 [ 2.      4.      1.     29.125 ]
 [39.      1.      1.     83.1583]
 [21.      0.      0.      7.7333]]

这篇关于Tensorflow 2.0:以功能性方式将数据集的数字特征打包在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆