输出类型中beam.ParDo 和beam.Map 的区别? [英] Difference between beam.ParDo and beam.Map in the output type?
问题描述
我正在使用 Apache-Beam 运行一些数据转换,包括从 txt、csv 和不同数据源中提取数据.我注意到的一件事是使用 beam.Map 和 beam.ParDo
I am using Apache-Beam to run some data transformation, which including data extraction from txt, csv, and different sources of data. One thing I noticed, is the difference of results when using beam.Map and beam.ParDo
在下一个示例中:
我正在读取 csv 数据,在第一种情况下,使用 beam.ParDo 将它传递给 DoFn,它提取第一个元素,即日期,然后打印它.第二种情况,我直接用beam.Map做同样的事情,然后打印出来.
I am reading csv data, and in the first case pass it to a DoFn using a beam.ParDo, which extracts the first element which is the date, then print it. In the second case, I directly use beam.Map to do the same thing, then print it.
class Printer(beam.DoFn):
def process(self,data_item):
print data_item
class DateExtractor(beam.DoFn):
def process(self,data_item):
return (str(data_item).split(','))[0]
data_from_source = (p
| 'ReadMyFile 01' >> ReadFromText('./input/data.csv')
| 'Splitter using beam.ParDo 01' >> beam.ParDo(DateExtractor())
| 'Printer the data 01' >> beam.ParDo(Printer())
)
copy_of_the_data = (p
| 'ReadMyFile 02' >> ReadFromText('./input/data.csv')
| 'Splitter using beam.Map 02' >> beam.Map(lambda record: (record.split(','))[0])
| 'Printer the data 02' >> beam.ParDo(Printer())
)
我在两个输出中注意到的是下一个:
What I noticed in the two outputs are the next:
##With beam.ParDo##
2
0
1
7
-
0
4
-
0
3
2
0
1
7
##With beam.Map##
2017-04-03
2017-04-03
2017-04-10
2017-04-10
2017-04-11
2017-04-12
2017-04-12
我觉得这很奇怪.我想知道打印功能是否有问题?但是在使用不同的转换后,它显示出相同的结果.例如运行:
I find this strange. I am wondering if the problem in the printing function? But after using different transformations, it is showing the same results. As Example running:
| 'Group it 01' >> beam.Map(lambda record: (record, 1))
仍然返回相同的问题:
##With beam.ParDo##
('8', 1)
('2', 1)
('0', 1)
('1', 1)
##With beam.Map##
(u'2017-04-08', 1)
(u'2017-04-08', 1)
(u'2017-04-09', 1)
(u'2017-04-09', 1)
知道是什么原因吗?我在 beam.Map 和 beam.ParDo 之间的区别中缺少什么???
Any idea what is the reason? What do I am missing in the difference between beam.Map and beam.ParDo ???
推荐答案
简短回答
您需要将 ParDo
的返回值包装到一个列表中.
You need to wrap the return value of a ParDo
into a list.
更长的版本
ParDos
通常可以为单个输入返回任意数量的输出,即对于单个输入字符串,您可以发出零个、一个或多个结果.出于这个原因,Beam SDK 将 ParDo
的输出视为不是单个元素,而是元素的集合.
ParDos
in general can return any number of outputs for a single input, i.e. for a single input string you can emit zero, one, or many results. For this reason the Beam SDK treats the output of a ParDo
as not a single element, but a collection of elements.
在您的情况下,ParDo
发出单个字符串而不是集合.Beam Python SDK 仍然尝试将 ParDo
的输出解释为元素的集合.它通过将您发出的字符串解释为字符集合来实现.因此,您的 ParDo
现在可以有效地生成单个字符流,而不是字符串流.
In your case the ParDo
emits a single string instead of a collection. Beam Python SDK still tries to interpret the output of that ParDo
as if it was a collection of elements. And it does so by interpreting the string you emitted as collection of characters. Because of that, your ParDo
now effectively produces a stream of single characters, not a stream of strings.
您需要做的是将您的返回值包装到一个列表中:
What you need to do is wrap your return value into a list:
class DateExtractor(beam.DoFn):
def process(self,data_item):
return [(str(data_item).split(','))[0]]
注意方括号.有关更多示例,请参阅编程指南.
Notice the square brackets. See the programming guide for more examples.
Map
可以被认为是 ParDo
的一个特例.Map
预计为每个输入生成一个输出.因此,在这种情况下,您只需从 lambda 返回一个值,它就会按预期工作.
Map
, on the other hand, can be thought of as a special case of ParDo
. Map
is expected to produce exactly one output for each input. So in this case you can just return a single value out of lambda and it works as expected.
而且您可能不需要将 data_item
包装在 str
中.根据文档ReadFromText
转换产生字符串.
And you probably don't need to wrap the data_item
in str
. According to the docs the ReadFromText
transform produces strings.
这篇关于输出类型中beam.ParDo 和beam.Map 的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!