Google Cloud DataFlow随机化WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

查看：124 发布时间：2018/5/7 17:26:43 google-bigquery google-cloud-platform google-cloud-dataflow

本文介绍了Google Cloud DataFlow随机化WritetoBigQuery的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我成功实现了一个写入BigQuery的数据流管道。此管道正在转换Cloud ML Engine作业的数据。但是，我注意到已写入的行按照我的数据标签排序（或至少分组）。通过这个，我的意思是它们在视觉上似乎以某种方式组织起来（这不是完全随机的）。然后，当我将表格导出到GCS中的分片.csv时，每个分片.csv基本上都是有序的。这意味着数据不能随机输入到TensorFlow中，因为TF一次抓取一个.csv文件，而.csv本身不是随机的包或行。

I have succesfully implemented a dataflow pipeline that writes to BigQuery. This pipeline is transforming data for a Cloud ML Engine job. However, I noticed that the rows that have been written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually appear to be organized in some way (that is not completely random). Then when I export the table to sharded .csv's in GCS, each sharded .csv is essentially ordered. This means that the data cannot be entered into TensorFlow randomly since TF grabs one .csv at a time and the .csv's themselves are not random bags or rows.

任何人都可以解释为什么如果原始输入数据是随机化的，那么由apache波束流水线写入的BigQuery表似乎是非随机的？在写入BigQuery之前，有什么办法强制对行进行随机/混合操作？我需要确保训练数据在加载到ML模型之前是完全随机的。

Can anybody explain why the BigQuery table written by the apache beam pipeline would appear to be non-random if the original input data was randomized? Is there any way to force a shuffle/randomization of rows before writing to BigQuery? I need to ensure that the training data is completely random before being loaded into the ML model.

Google Cloud DataFlow随机化WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Google Cloud DataFlow随机化WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭