PCollection到数组-如何将标头动态输入到WriteToText PTransform中? [英] PCollection to Array - How to dynamically input a header into a WriteToText PTransform?

查看:58
本文介绍了PCollection到数组-如何将标头动态输入到WriteToText PTransform中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用主要在Dataflow运行器上运行的Apache Beam 2.19编写数据流作业.我正在尝试将包含嵌套字段和重复字段的BigQuery输入转换为平坦的CSV.BQ输入使用递归方法展平.我需要将展平格式写入CSV文件,这不是问题,除了我需要将字典关键字作为标题传递.我可以将标头转换为pvalue singelton,但无法将其作为标头参数(接受数组)的输入传递. https://beam.apache.org/releases/pydoc/2.19.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText

I am writing a dataflow job using Apache beam 2.19 running on the Dataflow runner primarily. I am attempting to Transform a BigQuery input with nested and repeated fields to a flattened CSV. The BQ input is flattened using a recursive method. I need to write the flattened format to a CSV file which is not a problem except I need to pass the dictionary keys as a header. I can transform the headers into a pvalue singelton but I am unable to pass this as an input to the header parameter (Accepts an array). https://beam.apache.org/releases/pydoc/2.19.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText

推荐答案

很遗憾,当前不支持此功能.文件头只能在管道构建时指定,因此目前最好的解决方案是尝试在管道构建时而不是执行时生成所需的头.

Unfortunately this is currently unsupported. File headers can only be specified at pipeline construction time, so the best solution at the moment is to try to generate the header you need at pipeline construction time instead of execution time.

也就是说,您可以通过某种方式欺骗"以获得相同的结果.例如,您可以编写一个CombineFn,将所有输入元素组合到TextIO到包含CSV正文的单个字符串中.然后,将其发送到以字典键作为侧面输入的ParDo,并将其作为标题附加到CSV正文的开头,最后将代表整个文件的字符串发送到TextIO转换.

That said, you may be able to "cheat" this in a way to get the same result. For example, you could write a CombineFn that combines all your input elements to the TextIO into a single string containing the CSV body. Then send that to a ParDo that takes the dictionary keys as a side input and appends them to the beginning of your CSV body as a header, and finally sends that string representing your whole file to your TextIO transform.

重申一下,这是克服缺乏支持的一种作法,它可能比本机支持的动态标头更脆弱,性能更差.如果您能够通过在管道构建时生成标头来避免此问题,那就更好了.

To reiterate, that's a bit of a cheat to get around the lack of support, and it's probably more brittle and less performant than a natively supported dynamic header would be. If you are able to avoid the issue by generating the header at pipeline construction time instead, that's far better.

这篇关于PCollection到数组-如何将标头动态输入到WriteToText PTransform中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆