使用Google Cloud Dataflow创建大型CSV数据 [英] Create Large CSV data using Google Cloud Dataflow

查看:69
本文介绍了使用Google Cloud Dataflow创建大型CSV数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要创建一个大型的csv文件,该文件包含〜20亿条记录,带有标头.使用独立脚本创建需要花费很长时间,但是由于记录不相关,因此我了解云数据流可以使其分散分布,从而使我选择的多台工作GCE机器成为可能.云数据流是否总是需要输入?在这里,我试图以编程方式生成以下格式的数据

I need to create a large csv file of ~ 2 billion records with header. It takes a long time to create using standalone script however since records are not related , I understand Cloud dataflow can make it distributed spinning up multiple worker GCE machines of my choice. Does cloud dataflow always need to have an input. Here I am trying to programmatically generate data of following format

ItemId,   ItemQuantity, ItemPrice, Salevalue, SaleDate
item0001, 25          , 100      , 2500      , 2017-03-18
item0002, 50          , 200      , 10000     , 2017-03-25 

注意

ItemId可以使用0001至9999之间的任意随机数作为后缀ItemQuantity可以是(1至1000)之间的随机值ItemPrice可以是(1至100)之间的随机值SaleValue = ItemQuantity * ItemPrice日期介于2015-01-01至2017-12-31

ItemId can be postfixed with any random number between 0001 to 9999 ItemQuantity can be random value between (1 to 1000) ItemPrice can be random value between (1 to 100) SaleValue = ItemQuantity*ItemPrice Date between 2015-01-01 to 2017-12-31

任何语言都可以.

从问题使用Google Cloud Dataflow生成大文件继续

推荐答案

当前,还没有一种非常优雅的方法.在Python中,您可以执行此操作(与Java相同,只是语法会发生变化):

Currently, there is not a very elegant way of doing this. In Python you would do this (same thing for Java, just the syntax changes):

def generate_keys():
  for i in range(2000):
    # Generate 2000 key-value pairs to shuffle
    yield (i, 0)

def generate_random_elements():
  for i in range(1000000):
    yield random_element()

p = beam.Pipeline(my_options)
(p 
 | beam.Create(['any']) 
 | beam.FlatMap(generate_keys)
 | beam.GroupByKey()
 | beam.FlatMap(generate_random_elements)
 | beam.WriteToText('gs://bucket-name/file-prefix'))

generate_keys()中,我们生成2000个不同的密钥,然后运行GroupByKey,以便将它们改组给不同的工作程序.我们需要这样做,因为DoFn当前不能当前分配给多个工作人员. (一旦实现SplittableDoFn,这会容易得多.)

In generate_keys() we are generating 2000 different keys, and then we run GroupByKey so that they will be shuffled to different workers. We need to do this, because the DoFn can not currently be split across several workers. (Once SplittableDoFn is implemented, this will be much easier).

请注意,当Dataflow将结果写到接收器时,通常会将它们分成不同的文件(例如gs://bucket-name/file-prefix-0000-00001等),因此您需要将文件压缩在一起.

As a note, when Dataflow writes results out to sinks, it commonly separates them into different files (e.g. gs://bucket-name/file-prefix-0000-00001, and so on), so you'll need to condense the files together.

此外,您可以使用--num_workers 10,也可以使用许多--num_workers 10在Dataflow中生成,或使用自动缩放.

Also, you can use --num_workers 10, or however many to spawn in Dataflow, or use autoscaling.

这篇关于使用Google Cloud Dataflow创建大型CSV数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆