"|"是什么意思和“>"在Apache Beam中意味着什么? [英] What do the "|" and ">>" means in Apache Beam?

查看：103 发布时间：2021/4/7 20:58:13 python-3.x apache-beam

本文介绍了"|"是什么意思和“>"在Apache Beam中意味着什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解Apache Beam.我遵循的是编程指南，在一个示例中，他们说的是关于以下代码示例将两个PCollection与CoGroupByKey联接在一起，然后由ParDo联接以使用结果.然后，该代码使用标签来查找每个集合中的数据并设置其格式..

I'm trying to understand Apache Beam. I was following the programming guide and in one example, they say talk about The following code example joins the two PCollections with CoGroupByKey, followed by a ParDo to consume the result. Then, the code uses tags to look up and format data from each collection..

我很惊讶，因为我在任何时候都没有看到 ParDo 操作，所以我开始怀疑 | 是否实际上是 ParDo.代码如下:

I was quite surprised, because I didn't saw at any point a ParDo operation, so I started to wondering if the | was actually the ParDo. The code looks like this:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

emails_list = [
    ('amy', 'amy@example.com'),
    ('carl', 'carl@example.com'),
    ('julia', 'julia@example.com'),
    ('carl', 'carl@email.com'),
]
phones_list = [
    ('amy', '111-222-3333'),
    ('james', '222-333-4444'),
    ('amy', '333-444-5555'),
    ('carl', '444-555-6666'),
]

pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
    emails = p | 'CreateEmails' >> beam.Create(emails_list)
    phones = p | 'CreatePhones' >> beam.Create(phones_list)
    results = ({'emails': emails, 'phones': phones} | beam.CoGroupByKey())
    
    def join_info(name_info):
        (name, info) = name_info
        return '%s; %s; %s' %\
      (name, sorted(info['emails']), sorted(info['phones']))

    contact_lines = results | beam.Map(join_info)

我确实注意到在管道的开头读取了 emails 和 phones ，所以我想它们都是不同的 PCollections ，对?但是 ParDo 在哪里执行?"|"是什么意思和>"其实是什么意思?以及我如何看到它的实际输出?如果在DAG之外定义了 join_info 函数， emails_list 和 phones_list ，这有关系吗?

I do notice that emails and phones are read at the start of the pipeline, so I guess that both of them are different PCollections, right? But where is the ParDo executed? What do the "|" and ">>" actually means? And how I can see the actual output of this? Does it matter if the join_info function, the emails_list and phones_list are defined outside the DAG?

推荐答案

| 表示步骤之间的分隔，这是(使用 p 作为 Pbegin ): p |ReadFromText(..)|ParDo(..)|GroupByKey().

The | represents a separation between steps, this is (using p as Pbegin): p | ReadFromText(..) | ParDo(..) | GroupByKey().

您还可以在 | 之前引用其他 PCollections :

You can also reference other PCollections before |:

read = p  | ReadFromText(..)
kvs = read | ParDo(..)
gbk = kvs | GroupByKey()

这等效于先前的管道: p |ReadFromText(..)|ParDo(..)|GroupByKey()

That's equivalent to the previous pipeline: p | ReadFromText(..) | ParDo(..) | GroupByKey()

在 | 和 PTransform 之间使用>> 来命名步骤: p |ReadFromText(..)|至关键值">>ParDo(..)|GroupByKey()

The >> are used between | and the PTransform to name the steps: p | ReadFromText(..) | "to key value" >> ParDo(..) | GroupByKey()

这篇关于"|"是什么意思和“>"在Apache Beam中意味着什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

"|"是什么意思和“>"在Apache Beam中意味着什么? [英] What do the "|" and ">>" means in Apache Beam?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

"|"是什么意思和“&gt;"在Apache Beam中意味着什么? [英] What do the &quot;|&quot; and &quot;&gt;&gt;&quot; means in Apache Beam?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

"|"是什么意思和“>"在Apache Beam中意味着什么? [英] What do the "|" and ">>" means in Apache Beam?

登录关闭