如何规范化张量流中模型的输入数据 [英] how to normalize input data for models in tensorflow

查看:22
本文介绍了如何规范化张量流中模型的输入数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的训练数据保存在3个文件中,每个文件都太大,无法放入内存.对于每个训练示例,数据都是二维的(2805行222列,第222列用于标签)并且是数字值.我想在输入模型进行训练之前对数据进行标准化.下面是我的 input_pipeline 代码,以及在创建数据集之前,数据尚未标准化.tensorflow 中是否有一些函数可以对我的情况进行归一化?

My training data are saved in 3 files, each file is too large and cannot fit into memory.For each training example, the data are two dimensionality (2805 rows and 222 columns, the 222nd column is for label) and are numerical values. I would like to normalize the data before feeding into models for training. Below is my code for input_pipeline, and the data has not been normalized before creating dataset. Is there some functions in tensorflow that can do normalization for my case?

dataset = tf.data.TextLineDataset([file1, file2, file3])
# combine 2805 lines into a single example
dataset = dataset.batch(2805)

def parse_example(line_batch):
    record_defaults = [[1.0] for col in range(0, 221)]
    record_defaults.append([1])
    content = tf.decode_csv(line_batch, record_defaults = record_defaults, field_delim = '\t')
    features = tf.stack(content[0:221])
    features = tf.transpose(features)
    label = content[-1][-1]
    label = tf.one_hot(indices = tf.cast(label, tf.int32), depth = 2)
    return features, label

dataset = dataset.map(parse_example)
dataset = dataset.shuffle(1000)
# batch multiple examples
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
data_batch, label_batch = iterator.get_next() 

推荐答案

规范化数据"有不同的方式.根据您的想法,在您的情况下实施起来可能容易也可能不容易.

There are different ways of "normalizing data". Depending which one you have in mind, it may or may not be easy to implement in your case.

如果您知道值的固定范围(例如,特征 #1 在 [-5, 5] 中有值,特征 #2 在 [0, 100 中有值] 等),您可以轻松地在 parse_example() 中预处理您的 feature 张量,例如:

If you know the fixed range(s) of your values (e.g. feature #1 has values in [-5, 5], feature #2 has values in [0, 100], etc.), you could easily pre-process your feature tensor in parse_example(), e.g.:

def normalize_fixed(x, current_range, normed_range):
    current_min, current_max = tf.expand_dims(current_range[:, 0], 1), tf.expand_dims(current_range[:, 1], 1)
    normed_min, normed_max = tf.expand_dims(normed_range[:, 0], 1), tf.expand_dims(normed_range[:, 1], 1)
    x_normed = (x - current_min) / (current_max - current_min)
    x_normed = x_normed * (normed_max - normed_min) + normed_min
    return x_normed

def parse_example(line_batch, 
                  fixed_range=[[-5, 5], [0, 100], ...],
                  normed_range=[[0, 1]]):
    # ...
    features = tf.transpose(features)
    features = normalize_fixed(features, fixed_range, normed_range)
    # ...

2.每个样本归一化

如果您的特征应该具有大致相同的值范围,也可以考虑按样本归一化,即考虑每个样本的特征矩(均值、方差)应用归一化:

2. Per-sample normalization

If your features are supposed to have approximately the same range of values, per-sample normalization could also be considered, i.e. applying normalization considering the features moments (mean, variance) for each sample:

def normalize_with_moments(x, axes=[0, 1], epsilon=1e-8):
    mean, variance = tf.nn.moments(x, axes=axes)
    x_normed = (x - mean) / tf.sqrt(variance + epsilon) # epsilon to avoid dividing by zero
    return x_normed

def parse_example(line_batch):
    # ...
    features = tf.transpose(features)
    features = normalize_with_moments(features)
    # ...

3.批量归一化

您可以对整个批次而不是每个样本应用相同的程序,这可能会使过程更加稳定:

3. Batch normalization

You could apply the same procedure over a complete batch instead of per-sample, which may make the process more stable:

data_batch = normalize_with_moments(data_batch, axis=[1, 2])

同样,您可以使用 tf.nn.batch_normalization

Similarly, you could use tf.nn.batch_normalization

使用在整个数据集上计算的均值/方差进行归一化将是最棘手的,因为正如您所提到的,它是一个大的拆分数据集.

Normalizing using the mean/variance computed over the whole dataset would be the trickiest, since as you mentioned it is a large, split one.

tf.data.Dataset 并不是真正意义上的全局计算.一种解决方案是使用您必须使用的任何工具来预先计算数据集矩,然后将此信息用于您的 TF 预处理.

tf.data.Dataset isn't really meant for such global computation. A solution would be to use whatever tools you have to pre-compute the dataset moments, then use this information for your TF pre-processing.

正如@MiniQuark 所述,Tensorflow 有一个 Transform 库,可用于预处理数据.查看入门,或者例如在tft.scale_to_z_score() 方法用于样本归一化.

As mentioned by @MiniQuark, Tensorflow has a Transform library you could use to preprocess your data. Have a look at the Get Started, or for instance at the tft.scale_to_z_score() method for sample normalization.

这篇关于如何规范化张量流中模型的输入数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆