"|"是什么意思和“>"在Apache Beam中意味着什么? [英] What do the "|" and ">>" means in Apache Beam?
问题描述
我试图了解Apache Beam.我遵循的是编程指南,在一个示例中,他们说的是关于以下代码示例将两个PCollection与CoGroupByKey联接在一起,然后由ParDo联接以使用结果.然后,该代码使用标签来查找每个集合中的数据并设置其格式.
.
I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection.
.
我很惊讶,因为我在任何时候都没有看到 ParDo
操作,所以我开始怀疑 |
是否实际上是 ParDo
.代码如下:
I was quite surprised, because I didn't saw at any point a ParDo
operation, so I started to wondering if the |
was actually the ParDo
. The code looks like this:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
emails_list = [
('amy', 'amy@example.com'),
('carl', 'carl@example.com'),
('julia', 'julia@example.com'),
('carl', 'carl@email.com'),
]
phones_list = [
('amy', '111-222-3333'),
('james', '222-333-4444'),
('amy', '333-444-5555'),
('carl', '444-555-6666'),
]
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
emails = p | 'CreateEmails' >> beam.Create(emails_list)
phones = p | 'CreatePhones' >> beam.Create(phones_list)
results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
def join_info(name_info):
(name, info) = name_info
return '%s; %s; %s' %\
(name, sorted(info['emails']), sorted(info['phones']))
contact_lines = results | beam.Map(join_info)
我确实注意到在管道的开头读取了 emails
和 phones
,所以我想它们都是不同的 PCollections
, 对?但是 ParDo
在哪里执行?"|"是什么意思和>"其实是什么意思?以及我如何看到它的实际输出?如果在DAG之外定义了 join_info
函数, emails_list
和 phones_list
,这有关系吗?
I do notice that emails
and phones
are read at the start of the pipeline, so I guess that both of them are different PCollections
, right? But where is the ParDo
executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info
function, the emails_list
and phones_list
are defined outside the DAG?
推荐答案
|
表示步骤之间的分隔,这是(使用 p
作为 Pbegin
): p |ReadFromText(..)|ParDo(..)|GroupByKey()
.
The |
represents a separation between steps, this is (using p
as Pbegin
): p | ReadFromText(..) | ParDo(..) | GroupByKey()
.
您还可以在 |
之前引用其他 PCollections
:
You can also reference other PCollections
before |
:
read = p | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()
这等效于先前的管道: p |ReadFromText(..)|ParDo(..)|GroupByKey()
That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()
在 |
和 PTransform
之间使用>>
来命名步骤: p |ReadFromText(..)|至关键值">>ParDo(..)|GroupByKey()
The >>
are used between |
and the PTransform
to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()
这篇关于"|"是什么意思和“>"在Apache Beam中意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!