如何在 Apache Beam 中实现像 Spark 一样的 zipWithIndex? [英] How can I implement zipWithIndex like Spark in Apache Beam?
问题描述
Pcollection<String> p1 = {"a","b","c"}
PCollection< KV<Integer,String> > p2 = p1.apply("some operation ")
//{(1,"a"),(2,"b"),(3,"c")}
我需要让它可扩展到像 Apache Spark 这样的大文件,这样它就可以像这样工作:
I need to make it scalable for large file like Apache Spark such that it works like:
sc.textFile("./filename").zipWithIndex
我的目标是通过以可扩展的方式分配行号来保持大文件中各行之间的顺序.
My goal is to preserve the order between rows within a large file by assigning row numbers in a scalable way.
如何通过 Apache Beam 获取结果?
How can I get the result by Apache Beam?
一些相关帖子:Apache Flink 上的 zipWithIndex
推荐答案
(复制我来自 user@beam.apache.org 的回复)
这很有趣.所以如果我理解你的算法,它会像(伪代码):
This is interesting. So if I understand your algorithm, it would be something like (pseudocode):
A = ReadWithShardedLineNumbers(myFile) : output K<ShardOffset+LocalLineNumber>, V<Data>
B = A.ExtractShardOffsetKeys() : output K<ShardOffset>, V<LocalLineNumber>
C = B.PerKeySum() : output K<ShardOffset>, V<ShardTotalLines>
D = C.GlobalSortAndPrefixSum() : output K<ShardOffset> V<ShardLineNumberOffset>
E = [A,D].JoinAndCalculateGlobalLineNumbers() : output V<GlobalLineNumber+Data>
这有几个假设:
ReadWithShardedLineNumbers
:Sources 可以输出它们的分片偏移量,并且偏移量是全局排序的GlobalSortAndPrefixSum
:所有读取分片的总数可以放入内存中以执行总排序
ReadWithShardedLineNumbers
: Sources can output their shard offset, and the offsets are globally orderedGlobalSortAndPrefixSum
: The totals for all read shards can fit in memory to perform a total sort
假设#2 不适用于所有数据大小,并且因运行程序而异,具体取决于读取分片的粒度.但对于一些实际的文件大小子集来说似乎是可行的.
Assumption #2 will not hold true for all data sizes, and varies by runner depending on how granular the read shards are. But it seems feasible for some practical subset of file-sizes.
另外,我相信上面的伪代码可以在 Beam 中表示,并且不需要 SplittableDoFn.
Also, I believe the pseudo-code above is representable in Beam, and would not require SplittableDoFn.
这篇关于如何在 Apache Beam 中实现像 Spark 一样的 zipWithIndex?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!