用tensorflow tf-transform进行数据标准化 [英] Data Normalization with tensorflow tf-transform

查看:2465
本文介绍了用tensorflow tf-transform进行数据标准化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Tensorflow自己的数据集进行神经网络预测。我做的第一个模型是在我的电脑中使用一个小数据集。在此之后,我稍微更改了一些代码,以便使用带有更大数据集的Google Cloud ML-Engine在ML-Engine中实现火车和预测。



我对熊猫数据框中的功能进行了规范化处理,但是这会导致偏差,并且我得到的预测结果很差。

我真正喜欢的是使用库 tf-transform 来规范图中的数据。为此,我想创建一个函数 preprocessing_fn 并使用' tft.scale_to_0_1 '。 https://github.com/tensorflow/transform/blob/master/getting_started。 md



我发现的主要问题是当我尝试做预测时。我正在寻找互联网,但我没有找到任何数据在培训中被标准化的导出模型的例子。在我发现的所有例子中,数据在任何地方都没有被标准化。



我想知道的是如果我在训练中规范化数据,并且发送一个新数据的新实例来进行预测,那么归一化这个数据?



也许在Tensorflow Data Pipeline中?做标准化的变量会保存在某个地方吗?



总结:我正在寻找一种方法来标准化我的模型的输入,然后新的实例也变得标准化了。

解决方案

首先,你并不需要tf.transform。您只需编写一个函数,您可以从training / eval input_fn和您的服务input_fn中调用该函数。

例如,假设您已经使用过Pandas在你的整个数据集中计算出最小和最大值

$ $
def add_engineered(features):
min_x = 22
max_x = 43
features ['x'] =(features ['x'] - min_x)/(max_x - min_x)
返回特征
```

然后,在input_fn中,通过调用add_engineered来包装返回的功能:

 ```
def input_fn():
features = ...
label = ...
return add_engineered(features),label
```

并在您的serving_input fn中,确保类似地使用add_engineered调用返回的功能(不是feature_placeholders):

 ` ``
def serving_input_fn():
feature_placeholders = ...
fea tures = ...
return tflearn.utils.input_fn_utils.InputFnOps(
add_engineered(features),
None,
feature_placeholders

```

现在,您在预测时输入的JSON只需要包含原始的未缩放值。 p>

下面是这种方法的一个完整的工作示例。



https://github.com/GoogleCloudPlatform/training-data-analyst/ blob / master / courses / machine_learning / feateng / taxifare / trainer / model.py#L107

tf.transform提供了一个两阶段过程:分析步骤来计算最小值,最大值和图形修改步骤,以将缩放比例插入TensorFlow图形中。因此,要使用tf.transform,首先需要编写一个Dataflow管道进行分析,然后在TensorFlow代码中插入对tf.scale_0_to_1的调用。这是一个这样做的例子:



https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft



add_engineered()方法更简单,我会建议。如果您的数据分布会随时间而改变,那么需要tf.transform方法,因此您需要自动化整个流程(例如,用于持续培训)。


I'm doing a neural network prediction with my own datasets using Tensorflow. The first I did was a model that works with a small dataset in my computer. After this, I changed the code a little bit in order to use Google Cloud ML-Engine with bigger datasets to realize in ML-Engine the train and the predictions.

I am normalizing the features in the panda dataframe but this introduces skew and I get poor prediction results.

What I would really like is use the library tf-transform to normalize the data in the graph. To do this, I would like to create a function preprocessing_fn and use the 'tft.scale_to_0_1'. https://github.com/tensorflow/transform/blob/master/getting_started.md

The main problem that I found is when I'm trying to do the predict. I'm looking for internet but I don't find any example of exported model where the data is normalized in the training. In all the examples I found, the data is NOT normalized anywhere.

What I would like to know is If I normalize the data in the training and I send a new instance with new data to do the prediction, how is normalized this data?

¿Maybe in the Tensorflow Data Pipeline? The variables to do the normalization are saved in some place?

In summary: I'm looking for a way to normalize the inputs for my model and then that the new instances also become standardized.

解决方案

First of all, you don't really need tf.transform for this. All you need to do is to write a function that you call from both the training/eval input_fn and from your serving input_fn.

For example, assuming that you've used Pandas on your whole dataset to figure out the min and max

```
def add_engineered(features):
  min_x = 22
  max_x = 43
  features['x'] = (features['x'] - min_x) / (max_x - min_x)
  return features
```

Then, in your input_fn, wrap the features you return with a call to add_engineered:

```
def input_fn():
  features = ...
  label = ...
  return add_engineered(features), label
```

and in your serving_input fn, make sure to similarly wrap the returned features (NOT the feature_placeholders) with a call to add_engineered:

```
def serving_input_fn():
    feature_placeholders = ...
    features = ...
    return tflearn.utils.input_fn_utils.InputFnOps(
      add_engineered(features),
      None,
      feature_placeholders
    )
```

Now, your JSON input at prediction time would only need to contain the original, unscaled values.

Here's a complete working example of this approach.

https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/courses/machine_learning/feateng/taxifare/trainer/model.py#L107

tf.transform provides for a two-phase process: an analysis step to compute the min, max and a graph-modification step to insert the scaling for you into your TensorFlow graph. So, to use tf.transform, you first need to write a Dataflow pipeline does the analysis and then plug in calls to tf.scale_0_to_1 inside your TensorFlow code. Here's an example of doing this:

https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft

The add_engineered() approach is simpler and is what I would suggest. The tf.transform approach is needed if your data distributions will shift over time, and so you want to automate the entire pipeline (e.g. for continuous training).

这篇关于用tensorflow tf-transform进行数据标准化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆