输出类型中beam.ParDo和beam.Map之间的区别？ [英] Difference between beam.ParDo and beam.Map in the output type?

查看：337 发布时间：2020/6/6 19:18:14 python-2.7 apache-beam apache-beam-io

本文介绍了输出类型中beam.ParDo和beam.Map之间的区别？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Apache-Beam运行一些数据转换，包括从txt，csv和不同数据源提取数据。
我注意到的一件事是使用 beam.Map 和 beam.ParDo

I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo

在下一个示例中：

我正在读取csv数据，在第一种情况下，使用 beam.ParDo 将其传递给DoFn ，它将提取第一个元素即日期，然后进行打印。
在第二种情况下，我直接使用 beam.Map 做同样的事情，然后打印出来。

I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo, which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it.

class Printer(beam.DoFn):
    def process(self,data_item):
        print data_item

class DateExtractor(beam.DoFn):
    def process(self,data_item):
        return (str(data_item).split(','))[0]

data_from_source = (p
                    | 'ReadMyFile 01' >> ReadFromText('./input/data.csv')
                    | 'Splitter using beam.ParDo 01' >> beam.ParDo(DateExtractor())
                    | 'Printer the data 01' >> beam.ParDo(Printer())
                    )

copy_of_the_data =  (p
                    | 'ReadMyFile 02' >> ReadFromText('./input/data.csv')
                    | 'Splitter using beam.Map 02' >> beam.Map(lambda record: (record.split(','))[0])
                    | 'Printer the data 02' >> beam.ParDo(Printer())
                    )

我在两个输出中注意到的是下一个：

What I noticed in the two outputs are the next:

##With beam.ParDo##
2
0
1
7
-
0
4
-
0
3
2
0
1
7

##With beam.Map##
2017-04-03
2017-04-03
2017-04-10
2017-04-10
2017-04-11
2017-04-12
2017-04-12

我觉得这很奇怪。我想知道打印功能是否有问题？但是在使用不同的转换后，它显示出相同的结果。
作为示例运行：

I find this strange. I am wondering if the problem in the printing function? But after using different transformations, it is showing the same results. As Example running:

| 'Group it 01' >> beam.Map(lambda record: (record, 1))

仍然返回相同的问题： / p>

which still returning the same issue :

##With beam.ParDo##
('8', 1)
('2', 1)
('0', 1)
('1', 1)

##With beam.Map##
(u'2017-04-08', 1)
(u'2017-04-08', 1)
(u'2017-04-09', 1)
(u'2017-04-09', 1)

任何人都知道原因是什么？ beam.Map 和 beam.ParDo

Any idea what is the reason? What do I am missing in the difference between beam.Map and beam.ParDo ???

推荐答案

简短答案

您需要包装 ParDo的返回值进入列表。

高级版本

ParDos 通常可以为单个输入返回任意数量的输出，即对于单个输入字符串，您可以发出零个，一个或多个结果。因此，Beam SDK将 ParDo 的输出视为单个元素，而不是元素的集合。

ParDos in general can return any number of outputs for a single input, i.e. for a single input string you can emit zero, one, or many results. For this reason the Beam SDK treats the output of a ParDo as not a single element, but a collection of elements.

在您的情况下， ParDo 发出单个字符串而不是集合。 Beam Python SDK仍尝试将 ParDo 的输出解释为好像是元素的集合。通过将您发出的字符串解释为字符集合来实现。因此，您的 ParDo 现在可以有效地产生单个字符流，而不是字符串流。

In your case the ParDo emits a single string instead of a collection. Beam Python SDK still tries to interpret the output of that ParDo as if it was a collection of elements. And it does so by interpreting the string you emitted as collection of characters. Because of that, your ParDo now effectively produces a stream of single characters, not a stream of strings.

什么您需要做的就是将返回值包装在一个列表中：

What you need to do is wrap your return value into a list:

class DateExtractor(beam.DoFn):
    def process(self,data_item):
        return [(str(data_item).split(','))[0]]

请注意方括号。有关更多示例，请参见编程指南。

Notice the square brackets. See the programming guide for more examples.

地图视为 ParDo 的特例。预计地图会为每个输入恰好产生一个输出。因此，在这种情况下，您只需从lambda中返回一个值即可，它可以按预期工作。

Map, on the other hand, can be thought of as a special case of ParDo. Map is expected to produce exactly one output for each input. So in this case you can just return a single value out of lambda and it works as expected.

您可能不需要包装 str 中的> data_item 。根据文档 ReadFromText 转换产生字符串。

And you probably don't need to wrap the data_item in str. According to the docs the ReadFromText transform produces strings.

这篇关于输出类型中beam.ParDo和beam.Map之间的区别？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

输出类型中beam.ParDo和beam.Map之间的区别？ [英] Difference between beam.ParDo and beam.Map in the output type?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

输出类型中beam.ParDo和beam.Map之间的区别？ [英] Difference between beam.ParDo and beam.Map in the output type?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭