如何从Google Dataflow中的PCollection中获取元素列表并在流水线中使用它来循环写入变换? [英] How to get a list of elements out of a PCollection in Google Dataflow and use it in the pipeline to loop Write Transforms?

查看:165
本文介绍了如何从Google Dataflow中的PCollection中获取元素列表并在流水线中使用它来循环写入变换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我想:


  • 获取主PCollection中的唯一日期列表

  • 循环访问该列表中的日期以创建筛选的PCollections(每个日期均具有唯一的日期),然后分别写入将过滤的PCollection过滤到BigQuery中时分区表中的分区。



如何获取该列表?在下面的组合转换之后,我创建了一个ListPCollectionView对象,但我无法迭代该对象:

  class ToUniqueList(beam.CombineFn): 

def create_accumulator(self):
return []
$ b $ def add_input(self,accumulator,element):
如果元素不在累加器中:
accumulator.append(元素)
返回累加器

def merge_accumulators(self,accumulators):
返回列表(set(accumulators))

def extract_output(self,accumulator):
返回累加器


def get_list_of_dates(pcoll):

return(pcoll
|'获得日期列表>> beam.CombineGlobally(ToUniqueList()))

这一切做错了吗?什么是最好的办法呢?

>无法直接获取 PCollection 的内容 - Apache Beam或Dataflow管道更像是要执行的处理的查询计划,使用 PCollection 是计划中的逻辑中间节点,而不是包含数据。主程序组装了计划(管道)并将它踢开。



然而,最终你会试图将数据写入按日期分割的BigQuery表中。这个用例目前仅支持在Java SDK中,仅适用于流式管道。



有关根据数据将数据写入多个目标的更一般处理,请按照 BEAM -92



另见通过Google Cloud Dataflow创建/写入Parititoned BigQuery表


I am using Google Cloud Dataflow with the Python SDK.

I would like to :

  • Get a list of unique dates out of a master PCollection
  • Loop through the dates in that list to create filtered PCollections (each with a unique date), and write each filtered PCollection to its partition in a time-partitioned table in BigQuery.

How can I get that list ? After the following combine transform, I created a ListPCollectionView object but I cannot iterate that object :

class ToUniqueList(beam.CombineFn):

    def create_accumulator(self):
        return []

    def add_input(self, accumulator, element):
        if element not in accumulator:
            accumulator.append(element)
        return accumulator

    def merge_accumulators(self, accumulators):
        return list(set(accumulators))

    def extract_output(self, accumulator):
        return accumulator


def get_list_of_dates(pcoll):

    return (pcoll
            | 'get the list of dates' >> beam.CombineGlobally(ToUniqueList()))

Am I doing it all wrong ? What is the best way to do that ?

Thanks.

解决方案

It is not possible to get the contents of a PCollection directly - an Apache Beam or Dataflow pipeline is more like a query plan of what processing should be done, with PCollection being a logical intermediate node in the plan, rather than containing the data. The main program assembles the plan (pipeline) and kicks it off.

However, ultimately you're trying to write data to BigQuery tables sharded by date. This use case is currently supported only in the Java SDK and only for streaming pipelines.

For a more general treatment of writing data to multiple destinations depending on the data, follow BEAM-92.

See also Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

这篇关于如何从Google Dataflow中的PCollection中获取元素列表并在流水线中使用它来循环写入变换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆