Google Cloud DataFlow随机化WritetoBigQuery [英] Google Cloud DataFlow Randomize WritetoBigQuery

查看:124
本文介绍了Google Cloud DataFlow随机化WritetoBigQuery的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我成功实现了一个写入BigQuery的数据流管道。此管道正在转换Cloud ML Engine作业的数据。但是,我注意到已写入的行按照我的数据标签排序(或至少分组)。通过这个,我的意思是它们在视觉上似乎以某种方式组织起来(这不是完全随机的)。然后,当我将表格导出到GCS中的分片.csv时,每个分片.csv基本上都是有序的。这意味着数据不能随机输入到TensorFlow中,因为TF一次抓取一个.csv文件,而.csv本身不是随机的包或行。

I have succesfully implemented a dataflow pipeline that writes to BigQuery. This pipeline is transforming data for a Cloud ML Engine job. However, I noticed that the rows that have been written are ordered (or at least grouped) by the labels of my data. By this, I mean that they visually appear to be organized in some way (that is not completely random). Then when I export the table to sharded .csv's in GCS, each sharded .csv is essentially ordered. This means that the data cannot be entered into TensorFlow randomly since TF grabs one .csv at a time and the .csv's themselves are not random bags or rows.

任何人都可以解释为什么如果原始输入数据是随机化的,那么由apache波束流水线写入的BigQuery表似乎是非随机的?在写入BigQuery之前,有什么办法强制对行进行随机/混合操作?我需要确保训练数据在加载到ML模型之前是完全随机的。

Can anybody explain why the BigQuery table written by the apache beam pipeline would appear to be non-random if the original input data was randomized? Is there any way to force a shuffle/randomization of rows before writing to BigQuery? I need to ensure that the training data is completely random before being loaded into the ML model.

推荐答案

BigQuery表没有订单或分组的概念,它们只是一堆行;如果需要排序或分组,可以使用ORDER BY或GROUP BY子句编写查询。如果您有从BigQuery读取行的代码,并且需要按随机顺序读取这些行,则可以执行类似于 https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

BigQuery tables don't have the concept of order or grouping, they are just a bag of rows; if one needs ordering or grouping, one writes a query with an ORDER BY or GROUP BY clause. If you have code that reads rows from BigQuery and requires these rows to be read in random order, you can do something like https://www.oreilly.com/learning/repeatable-sampling-of-data-sets-in-bigquery-for-machine-learning

这篇关于Google Cloud DataFlow随机化WritetoBigQuery的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆