Tensorflow Transform:如何在整个数据集上找到变量的均值 [英] Tensorflow Transform: How to find the mean of a variable over the entire dataset

查看：33 发布时间：2021/9/5 19:11:41 python tensorflow

本文介绍了Tensorflow Transform:如何在整个数据集上找到变量的均值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我经常在许多 Tensorflow 教程中看到这样的文字:

I often see in many Tensorflow tutorials text like:

要进行此计算，您需要列均值.你显然会需要在现实生活中计算这些，但对于这个例子，我们只会提供它们.

To do this calculation, you need the column means. You would obviously need to compute these in real life, but for this example we'll just provide them.

对于中小型 CSV 数据集，计算平均值就像在数据帧上使用 pandas 方法或使用 `scikit-learn

For small or medium sized CSV datasets computing the mean is as easy as a pandas method on a dataframe or using `scikit-learn

但是，如果我们有大型数据集，比如说一个 50GB 的 CSV 文件，那么你如何计算平均值或其他类似的统计数据.Tensorflow Transform 声称它可以计算全局汇总统计数据，但他们并没有真正解释这是如何工作的或如何将其集成到工作流中.

BUT, if we have large dataset, say a CSV file that is 50GB, then how do you calculate the mean or other similar statistics. Tensorflow Transform claims that it can calculate global summary statistics, but they don't really explain how this work or how to integrate this into a workflow.

这是他们入门指南中的代码示例.

Here is the code example from their getting started guide.

import tensorflow as tf
import tensorflow_transform as tft

def preprocessing_fn(inputs):
  x = inputs['x']
  y = inputs['y']
  s = inputs['s']
  x_centered = x - tft.mean(x)
  y_normalized = tft.scale_to_0_1(y)
  s_integerized = tft.compute_and_apply_vocabulary(s)
  x_centered_times_y_normalized = x_centered * y_normalized
  return {
      'x_centered': x_centered,
      'y_normalized': y_normalized,
      'x_centered_times_y_normalized': x_centered_times_y_normalized,
      's_integerized': s_integerized
  }

文档说这段代码将在整个数据集上运行 tft.mean(x)，但不清楚这将如何发生，因为 x 仅限于只是批次的范围?然而，这是文档中的声明.

The documentation says that this code will run tft.mean(x) over the entire dataset, but it is not clear how that will happen since x is limited to just the scope of the batch? Yet here is the claim in the documentation.

虽然在上面的例子中不明显，但用户定义的预处理函数被传递代表批次而不是单个的张量实例，就像在训练和使用 TensorFlow 服务期间发生的那样.在另一方面，分析器对整个返回单个值而不是一批值的数据集.x 是一个形状为 (batch_size,) 的张量，而 tft.mean(x) 是张量形状为 ().

While not obvious in the example above, the user defined preprocessing function is passed tensors representing batches and not individual instances, as happens during training and serving with TensorFlow. On the other hand, analyzers perform a computation over the entire dataset that returns a single value and not a batch of values. x is a Tensor with a shape of (batch_size,), while tft.mean(x) is a Tensor with a shape of ().

所以问题是

tft.mean() 是先遍历整个数据集，然后计算全局均值后才开始批量加载吗?

Does tft.mean() run over the entire dataset first, and only after computing the global mean does it begin to load batches?

是否有在工作流中使用 tft.transforms 的更详细或完整示例?像这些转换可以包含在 tf.data.Dataset.map() 调用上的单个批处理 preprocessing 函数中，或者如何包含?

Are there any more detailed or complete examples of using tft.transforms in a workflow? Like can these tranforms be included in a single batch preprocessing function on a tf.data.Dataset.map() call, or how?

因此，如果我试图编写一些代码来计算我的 tensorflow 数据集中个人的平均 age.这是我到目前为止的代码.这是做这样的事情的最好方法，还是有更好的方法?

So if I was trying to write some code to calculate the average age of individuals in my tensorflow dataset. Here is the code I have so far. Is this the best way to do something like this, or is there a better way?

我使用了 tensorflow-2.0 make_csv_dataset() 它负责将 CSV 文件中的示例堆叠到列结构中.请注意，我从上面链接中引用的 tensorflow 网站上的新教程中获取了 make_csv_dataset() 的代码.

I used the tensorflow-2.0 make_csv_dataset() which takes care of stacking the examples from the CSV file into a column structure. Note I took the code for the make_csv_dataset() from the new tutorial on the tensorflow website referenced in the link above.

  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=32, 
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)

 ds_iter = dataset.make_one_shot_iterator()

 list_of_batch_means = []

 for ex_features, ex_labels in ds_iter:
    batch_length = len(ex_features)
    batch_sum = tf.reduce_sum(ex_features['age'])
    list_of_batch_means.append(batch_sum/len(ex_features)

 average_age = np.mean(list_of_batch_means)

作为警告，我将 batch_sum/len(ex_features) 分开，因为最终批次的大小不一定与其他批次相同，因此我手动计算而不是使用 tf.reduce_mean().如果您有很多批次，这可能是一个小问题，但只是想尽可能准确.

As a caveat, I divided the batch_sum/len(ex_features) since the final batch will not necessarily be the same size as the other batches, hence I did that calculate manually instead of using tf.reduce_mean().This might be a minor issue if you have a lot of batches, but just wanted to be as accurate as possible.

任何建议将不胜感激.

Tensorflow Transform:如何在整个数据集上找到变量的均值 [英] Tensorflow Transform: How to find the mean of a variable over the entire dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Tensorflow Transform:如何在整个数据集上找到变量的均值 [英] Tensorflow Transform: How to find the mean of a variable over the entire dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭