Apache Beam TextIO.ReadAll如何发出KeyValue而不是Pcollection的字符串 [英] Apache Beam TextIO.ReadAll How to emit KeyValue instead of String of Pcollection

查看:54
本文介绍了Apache Beam TextIO.ReadAll如何发出KeyValue而不是Pcollection的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

管道从PUBSUBIo读取开始.PubSub IO内的消息是GCS文件路径.我知道我可以使用 ReadAll()从每个路径发出行.但是,它不符合我的目的(有关文件路径的信息丢失了).我需要发出的是 KV<'Filepath','files内部的行'> .

Pipeline Starts by Reading from PUBSUBIo. The message inside PubSub IO is a GCS file path. I know that I can use ReadAll() to emit the lines from each path. However, it doesn't serve my purpose(Information regarding the file path is lost). What I need is to emit is a KV<'Filepath','Lines inside files'>.

PubSUB消息看起来像

PubSUB messages will look like

Message1 -> gs://folder1/Topic1/topicfile1.gz
Message2 -> gs://folder1/Topic2/topicfile2.gz

假设文件内容如下所示

topicfile1.gz
{
topic1.line1
topic1.line2
}

topicfile2.gz
{
topic2.line1
topic2.line2
}

我期望的是像下面这样的收藏集

What I am expecting is a pcollection like the one below

{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line1'>}
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line2'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line1'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line2'>}

我找不到从 ParDo 函数内部的路径读取文件以将路径映射到行的方法.

I could't find a way to read a file from a path inside the ParDo function to map the path to the lines.

希望这很清楚.

推荐答案

如果我正确理解了这个问题,我认为在 TextIO 中不支持此功能.

I don't think this is supported in TextIO out of the box if I understood the question correctly.

详细信息

当您应用像 readAll()这样的转换时,在从IO获取初始文件路径和最后从所有文件发出所有行之间,涉及两个步骤.

When you apply transforms like readAll() there are a couple of steps involved between getting the initial file paths from the IO and emitting all the lines from all the files in the end.

例如,逻辑

For example, the logic in TextIO:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆