PCollection to Array - 如何将标头动态输入到 WriteToText PTransform 中? [英] PCollection to Array - How to dynamically input a header into a WriteToText PTransform?

查看:36
本文介绍了PCollection to Array - 如何将标头动态输入到 WriteToText PTransform 中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我主要使用在 Dataflow runner 上运行的 Apache beam 2.19 编写数据流作业.我正在尝试将带有嵌套和重复字段的 BigQuery 输入转换为扁平的 CSV.BQ 输入使用递归方法展平.我需要将扁平格式写入 CSV 文件,这不是问题,除非我需要将字典键作为标题传递.我可以将标头转换为 pvalue singelton,但我无法将其作为输入传递给标头参数(接受数组).https://beam.apache.org/releases/pydoc/2.19.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText

I am writing a dataflow job using Apache beam 2.19 running on the Dataflow runner primarily. I am attempting to Transform a BigQuery input with nested and repeated fields to a flattened CSV. The BQ input is flattened using a recursive method. I need to write the flattened format to a CSV file which is not a problem except I need to pass the dictionary keys as a header. I can transform the headers into a pvalue singelton but I am unable to pass this as an input to the header parameter (Accepts an array). https://beam.apache.org/releases/pydoc/2.19.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText

推荐答案

很遗憾,目前不支持.文件头只能在管道构建时指定,所以目前最好的解决方案是尝试在管道构建时而不是执行时生成您需要的头.

Unfortunately this is currently unsupported. File headers can only be specified at pipeline construction time, so the best solution at the moment is to try to generate the header you need at pipeline construction time instead of execution time.

也就是说,您可以通过某种方式欺骗"它以获得相同的结果.例如,您可以编写一个 CombineFn,将 TextIO 的所有输入元素组合成一个包含 CSV 正文的字符串.然后将其发送到 ParDo,该 ParDo 将字典键作为侧面输入,并将它们作为标题附加到 CSV 正文的开头,最后将表示整个文件的字符串发送到 TextIO 转换.

That said, you may be able to "cheat" this in a way to get the same result. For example, you could write a CombineFn that combines all your input elements to the TextIO into a single string containing the CSV body. Then send that to a ParDo that takes the dictionary keys as a side input and appends them to the beginning of your CSV body as a header, and finally sends that string representing your whole file to your TextIO transform.

重申一下,这是为了解决缺乏支持的问题,而且它可能比本机支持的动态标头更脆弱,性能更差.如果您能够通过在管道构建时生成标头来避免该问题,那就更好了.

To reiterate, that's a bit of a cheat to get around the lack of support, and it's probably more brittle and less performant than a natively supported dynamic header would be. If you are able to avoid the issue by generating the header at pipeline construction time instead, that's far better.

这篇关于PCollection to Array - 如何将标头动态输入到 WriteToText PTransform 中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆