Apache Beam TextIO.ReadAll如何发出KeyValue而不是Pcollection的字符串 [英] Apache Beam TextIO.ReadAll How to emit KeyValue instead of String of Pcollection
问题描述
管道从PUBSUBIo读取开始.PubSub IO内的消息是GCS文件路径.我知道我可以使用 ReadAll()
从每个路径发出行.但是,它不符合我的目的(有关文件路径的信息丢失了).我需要发出的是 KV<'Filepath','files内部的行'>
.
Pipeline Starts by Reading from PUBSUBIo. The message inside PubSub IO is a GCS file path. I know that I can use ReadAll()
to emit the lines from each path. However, it doesn't serve my purpose(Information regarding the file path is lost). What I need is to emit is a KV<'Filepath','Lines inside files'>
.
PubSUB消息看起来像
PubSUB messages will look like
Message1 -> gs://folder1/Topic1/topicfile1.gz
Message2 -> gs://folder1/Topic2/topicfile2.gz
假设文件内容如下所示
topicfile1.gz
{
topic1.line1
topic1.line2
}
topicfile2.gz
{
topic2.line1
topic2.line2
}
我期望的是像下面这样的收藏集
What I am expecting is a pcollection like the one below
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line1'>}
{KV<'gs://folder1/Topic1/topicfile1.gz','topic1.line2'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line1'>}
{KV<'gs://folder1/Topic2/topicfile2.gz','topic2.line2'>}
我找不到从 ParDo
函数内部的路径读取文件以将路径映射到行的方法.
I could't find a way to read a file from a path inside the ParDo
function to map the path to the lines.
希望这很清楚.
推荐答案
如果我正确理解了这个问题,我认为在 TextIO
中不支持此功能.
I don't think this is supported in TextIO
out of the box if I understood the question correctly.
详细信息
当您应用像 readAll()
这样的转换时,在从IO获取初始文件路径和最后从所有文件发出所有行之间,涉及两个步骤.
When you apply transforms like readAll()
there are a couple of steps involved between getting the initial file paths from the IO and emitting all the lines from all the files in the end.
For example, the logic in TextIO
:
- 它接受文件路径(或路径模式)的
PCollection
; - 它应用
FileIO.matchAll()
将路径模式的PCollection
转换为MatchResult.Metadata
PCollection >描述这些路径的对象; - 然后应用
FileIO.readMatches()
将元数据对象转换为描述特定文件的ReadableFile
对象; - 最后,它应用
TextIO.readFiles()
,该代码接受一个ReadableFile
并输出该文件中的所有字符串;- 在最后一步,您想向输出添加文件路径,以便知道哪个字符串来自哪个文件.如果可以选择更改最后一步以发出
KV< ReadableFile,String>
而不是仅字符串,那将有什么帮助,以便您可以使用ReadableFile.metadata 访问文件路径代码>.
- it accepts a
PCollection
of file paths (or path patterns); - it applies
FileIO.matchAll()
that converts thePCollection
of path patterns intoPCollection
ofMatchResult.Metadata
objects that describe those paths; - then it applies the
FileIO.readMatches()
that converts the metadata objects intoReadableFile
objects that describe specific files; - and lastly it applies
TextIO.readFiles()
that takes in aReadableFile
and outputs all the strings from that file;- at this last step you would want to add a file path to the output, so that you know which string comes from which file. What would help if there was an option to change the last step to emit
KV<ReadableFile, String>
instead of just strings, so that you could access the file path usingReadableFile.metadata
.
环顾四周的代码似乎是从文件中散发出原始行,这是目前使用
TextIO
的唯一受支持的方式.Looking around that code it seems that emitting the raw lines from the files is the only supported way of doing things using
TextIO
right now.解决方法
可能最直接的方法是编写类似于
TextIO.ReadAll
的自己的PTransform
.可以这样工作:Probably the most straightforward way is to write your own
PTransform
similar toTextIO.ReadAll
. This would work something like this:高级:
- 创建和自定义您自己的版本
TextIO.ReadAll
; - 以及
ReadAllViaFileBasedSource
; - 更改您的
ReadAllViaFileBasedSource
版本以发出您想要的内容; - 使用此自定义版本的
TextIO.ReadAll
,该版本使用您的自定义版本ReadAllViaFileBasedSource
发出正确的内容;
- Create and customize your own version
TextIO.ReadAll
; - And of
ReadAllViaFileBasedSource
; - Change your version of
ReadAllViaFileBasedSource
to emit what you want; - Use this custom version of
TextIO.ReadAll
that uses your custom versionReadAllViaFileBasedSource
that emits the correct things;
稍微详细一点:
- 只需复制整个实现的我上面提到的步骤;
- 但在
expand()
中,readFiles()
现在是ReadAllViaFileBasedSource
似乎是
- just copy the whole
TextIO.ReadAll
, it's a pretty short wrapper forFileIO
that implements the steps I mentioned above; - but in the
expand()
, at the last step, instead ofreadFiles()
you would apply a custom logic that will emit your desiredKVs
:readFiles()
right now is implemented byReadAllViaFileBasedSource
;ReadAllViaFileBasedSource
seems to be the actual thing that convertsReadableFiles
into strings;- you would create a copy of
ReadAllViaFileBasedSource
and change the output logic, so that instead of just emitting the file it also emits the metadata;
这篇关于Apache Beam TextIO.ReadAll如何发出KeyValue而不是Pcollection的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- at this last step you would want to add a file path to the output, so that you know which string comes from which file. What would help if there was an option to change the last step to emit
- 在最后一步,您想向输出添加文件路径,以便知道哪个字符串来自哪个文件.如果可以选择更改最后一步以发出