Google Cloud DataFlow 随机化 WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

查看:29
本文介绍了Google Cloud DataFlow 随机化 WritetoBigQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已成功实现了一个写入 BigQuery 的数据流管道.此管道正在转换 Cloud ML Engine 作业的数据.但是,我注意到已写入的行按我的数据标签排序(或至少分组).我的意思是,它们在视觉上似乎是以某种方式组织起来的(不是完全随机的).然后,当我将表导出到 GCS 中的分片 .csv 时,每个分片 .csv 基本上都是有序的.这意味着数据不能随机输入到 TensorFlow 中,因为 TF 一次抓取一个 .csv 并且 .csv 本身不是随机的包或行.

I have succesfully implemented a dataflow pipeline that writes to BigQuery. This pipeline is transforming data for a Cloud ML Engine job. However, I noticed that the rows that have been written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually appear to be organized in some way (that is not completely random). Then when I export the table to sharded .csv's in GCS, each sharded .csv is essentially ordered. This means that the data cannot be entered into TensorFlow randomly since TF grabs one .csv at a time and the .csv's themselves are not random bags or rows.

如果原始输入数据是随机的,谁能解释为什么 apache 光束管道编写的 BigQuery 表看起来是非随机的?在写入 BigQuery 之前,有什么方法可以强制对行进行洗牌/随机化?在加载到 ML 模型之前,我需要确保训练数据是完全随机的.

Can anybody explain why the BigQuery table written by the apache beam pipeline would appear to be non-random if the original input data was randomized? Is there any way to force a shuffle/randomization of rows before writing to BigQuery? I need to ensure that the training data is completely random before being loaded into the ML model.

推荐答案

BigQuery 表没有顺序或分组的概念,它们只是一袋行;如果需要排序或分组,可以使用 ORDER BY 或 GROUP BY 子句编写查询.如果您有从 BigQuery 读取行并要求以随机顺序读取这些行的代码,您可以执行类似 https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

BigQuery tables don't have the concept of order or grouping, they are just a bag of rows; if one needs ordering or grouping, one writes a query with an ORDER BY or GROUP BY clause. If you have code that reads rows from BigQuery and requires these rows to be read in random order, you can do something like https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

这篇关于Google Cloud DataFlow 随机化 WritetoBigQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆