Tensorflow Transform:如何在整个数据集上找到变量的均值 [英] Tensorflow Transform: How to find the mean of a variable over the entire dataset

查看:33
本文介绍了Tensorflow Transform:如何在整个数据集上找到变量的均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常在许多 Tensorflow 教程 中看到这样的文字:

I often see in many Tensorflow tutorials text like:

要进行此计算,您需要列均值.你显然会需要在现实生活中计算这些,但对于这个例子,我们只会提供它们.

To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.

对于中小型 CSV 数据集,计算平均值就像在数据帧上使用 pandas 方法或使用 `scikit-learn

For small or medium sized CSV datasets computing the mean is as easy as a pandas method on a dataframe or using `scikit-learn

但是,如果我们有大型数据集,比如说一个 50GB 的 CSV 文件,那么你如何计算平均值或其他类似的统计数据.Tensorflow Transform 声称它可以计算全局汇总统计数据,但他们并没有真正解释这是如何工作的或如何将其集成到工作流中.

BUT, if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics. Tensorflow Transform claims that it can calculate global summary statistics, but they don't really explain how this work or how to integrate this into a workflow.

这是他们入门指南中的代码示例.

Here is the code example from their getting started guide.

import tensorflow as tf
import tensorflow_transform as tft

def preprocessing_fn(inputs):
  x = inputs['x']
  y = inputs['y']
  s = inputs['s']
  x_centered = x - tft.mean(x)
  y_normalized = tft.scale_to_0_1(y)
  s_integerized = tft.compute_and_apply_vocabulary(s)
  x_centered_times_y_normalized = x_centered * y_normalized
  return {
      'x_centered': x_centered,
      'y_normalized': y_normalized,
      'x_centered_times_y_normalized': x_centered_times_y_normalized,
      's_integerized': s_integerized
  }

文档说这段代码将在整个数据集上运行 tft.mean(x),但不清楚这将如何发生,因为 x 仅限于只是批次的范围?然而,这是文档中的声明.

The documentation says that this code will run tft.mean(x) over the entire dataset, but it is not clear how that will happen since x is limited to just the scope of the batch? Yet here is the claim in the documentation.

虽然在上面的例子中不明显,但用户定义的预处理函数被传递代表批次而不是单个的张量实例,就像在训练和使用 TensorFlow 服务期间发生的那样.在另一方面,分析器对整个返回单个值而不是一批值的数据集.x 是一个形状为 (batch_size,) 的张量,而 tft.mean(x) 是张量形状为 ().

While not obvious in the example above, the user defined preprocessing function is passed tensors representing batches and not individual instances, as happens during training and serving with TensorFlow. On the other hand, analyzers perform a computation over the entire dataset that returns a single value and not a batch of values. x is a Tensor with a shape of (batch_size,), while tft.mean(x) is a Tensor with a shape of ().

所以问题是

  1. tft.mean() 是先遍历整个数据集,然后计算全局均值后才开始批量加载吗?

  1. Does tft.mean() run over the entire dataset first, and only after computing the global mean does it begin to load batches?

是否有在工作流中使用 tft.transforms 的更详细或完整示例?像这些转换可以包含在 tf.data.Dataset.map() 调用上的单个批处理 preprocessing 函数中,或者如何包含?

Are there any more detailed or complete examples of using tft.transforms in a workflow? Like can these tranforms be included in a single batch preprocessing function on a tf.data.Dataset.map() call, or how?

因此,如果我试图编写一些代码来计算我的 tensorflow 数据集中个人的平均 age.这是我到目前为止的代码.这是做这样的事情的最好方法,还是有更好的方法?

So if I was trying to write some code to calculate the average age of individuals in my tensorflow dataset. Here is the code I have so far. Is this the best way to do something like this, or is there a better way?

我使用了 tensorflow-2.0 make_csv_dataset() 它负责将 CSV 文件中的示例堆叠到列结构中.请注意,我从上面链接中引用的 tensorflow 网站上的新教程中获取了 make_csv_dataset() 的代码.

I used the tensorflow-2.0 make_csv_dataset() which takes care of stacking the examples from the CSV file into a column structure. Note I took the code for the make_csv_dataset() from the new tutorial on the tensorflow website referenced in the link above.

  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=32, 
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)

 ds_iter = dataset.make_one_shot_iterator()

 list_of_batch_means = []

 for ex_features, ex_labels in ds_iter:
    batch_length = len(ex_features)
    batch_sum = tf.reduce_sum(ex_features['age'])
    list_of_batch_means.append(batch_sum/len(ex_features)

 average_age = np.mean(list_of_batch_means)

作为警告,我将 batch_sum/len(ex_features) 分开,因为最终批次的大小不一定与其他批次相同,因此我手动计算而不是使用 tf.reduce_mean().如果您有很多批次,这可能是一个小问题,但只是想尽可能准确.

As a caveat, I divided the batch_sum/len(ex_features) since the final batch will not necessarily be the same size as the other batches, hence I did that calculate manually instead of using tf.reduce_mean().This might be a minor issue if you have a lot of batches, but just wanted to be as accurate as possible.

任何建议将不胜感激.

推荐答案

tf.transform 最重要​​的概念就是预处理函数.预处理函数是对数据集变换的逻辑描述.预处理函数接受并返回张量字典.有两种函数(步骤)用于定义预处理函数:

The most important concept of tf.transform is preprocessing function. The preprocessing function is the logical description of the transformation of the dataset. A preprocessing function accepts and returns a dictionary of Tensors. There are 2 kinds of functions(steps) used to define a preprocessing function :

  1. 分析步骤
  2. 转换步骤

分析步骤:它遍历整个数据集并创建一个图形.因此,例如为了计算均值,我们传递完整数据集以计算该数据集特定列的平均值(此步骤需要数据集的完整传递)

Analyze step: It iterates through the whole dataset and creates a graph. So, for example in order to calculate mean, we pass the full dataset to calculate the average of particular column of that dataset (This step requires the full pass of the dataset)

转换步骤:它基本上使用在分析步骤中创建的图形并转换完整的数据集.

Transform step: It basically uses the graph that has been created in the analyze step and transforms the complete dataset.

因此,基本上在分析步骤中计算的常数用于转换步骤.

So, basically the constants calculated in the analyze step is used in the Transform step.

为了更好地理解,您实际上可以通过这个视频 之后是这个 巩固您对 Tensorflow Transform 内部工作原理的理解.

For better understanding, you can actually go through this video followed by this presentation which sholud solidify your understanding of how Tensorflow Transform works internally.

如果您觉得回答有帮助,请点赞.谢谢!

If you feel that the answer is helpful, please upvote it. Thanks!

这篇关于Tensorflow Transform:如何在整个数据集上找到变量的均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆