Google Cloud DataFlow 随机化 WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

查看：29 发布时间：2021/12/30 23:16:38 google-bigquery google-cloud-platform google-cloud-dataflow

本文介绍了Google Cloud DataFlow 随机化 WritetoBigQuery的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已成功实现了一个写入 BigQuery 的数据流管道.此管道正在转换 Cloud ML Engine 作业的数据.但是，我注意到已写入的行按我的数据标签排序(或至少分组).我的意思是，它们在视觉上似乎是以某种方式组织起来的(不是完全随机的).然后，当我将表导出到 GCS 中的分片 .csv 时，每个分片 .csv 基本上都是有序的.这意味着数据不能随机输入到 TensorFlow 中，因为 TF 一次抓取一个 .csv 并且 .csv 本身不是随机的包或行.

I have succesfully implemented a dataflow pipeline that writes to BigQuery. This pipeline is transforming data for a Cloud ML Engine job. However, I noticed that the rows that have been written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually appear to be organized in some way (that is not completely random). Then when I export the table to sharded .csv's in GCS, each sharded .csv is essentially ordered. This means that the data cannot be entered into TensorFlow randomly since TF grabs one .csv at a time and the .csv's themselves are not random bags or rows.

如果原始输入数据是随机的，谁能解释为什么 apache 光束管道编写的 BigQuery 表看起来是非随机的?在写入 BigQuery 之前，有什么方法可以强制对行进行洗牌/随机化?在加载到 ML 模型之前，我需要确保训练数据是完全随机的.

Can anybody explain why the BigQuery table written by the apache beam pipeline would appear to be non-random if the original input data was randomized? Is there any way to force a shuffle/randomization of rows before writing to BigQuery? I need to ensure that the training data is completely random before being loaded into the ML model.

Google Cloud DataFlow 随机化 WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Google Cloud DataFlow 随机化 WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭