DataFlow(PY 2.x SDk)ReadFromPubSub :: id_label& timestamp_attribute行为异常 [英] DataFlow (PY 2.x SDk) ReadFromPubSub :: id_label & timestamp_attribute behaving unexpectedly
问题描述
我的apache梁管道(使用Python SDK + DirecrRunner进行测试…)正在从Pubsub主题中阅读
My apache beam pipeline (using Python SDK+ DirecrRunner for testing purpose…) is reading from Pubsub topic
消息&发布的属性如下:
The message & attributes published are as follows:
message: [{"col1": "test column 1", "col2": "test column 1"}]
attributes:{
'event_time_v1': str(time.time()),
'record_id': 'row-1’,
}
I’m using the function beam.io.gcp.pubsub.ReadFromPubSub. The code/doc mentions id_label
and timestamp_attribute
arguments (I believe these are very new additions?! Updated only 13 days ago..)
- 当我使用
id_label
为每个元素分配唯一的ID以进行重复数据删除时,出现以下错误:
- When I use
id_label
in order to assign each element a unique id for dedupe purpose, I get following error:
NotImplementedError:DirectRunner:PubSub读取不支持id_label``
NotImplementedError: DirectRunner: id_label is not supported for PubSub reads```
为什么这样?我的理解是正确的,我仍然缺少某些代码实现,还是我在这里缺少了什么?
- 当我使用
timestamp_attribute = 'event_time_v1’
时,为了给每个元素分配自己的时间戳(消息属性event_time_v1
中传递的客户端事件时间),我注意到实际上分配给该元素的时间戳仍然是消息发布时间
- When I use
timestamp_attribute = 'event_time_v1’
, in order to assign my own timestamp to each element (client side event time passed in message attributeevent_time_v1
), I notice timestamp actually assigned to the element is still the message publish time
为什么这样?我希望这是event_time_v1
why so? I expected it would be the time passed in event_time_v1
我正在使用以下DoFn来打印元素的时间戳记
I'm using following DoFn to print element's timestamp
class PrintFn(beam.DoFn):
print(element, timestamp)
return [element]
非常感谢您对此的任何解释
Thanks a lot in advance for any explanation to that
推荐答案
我今天也遇到了同样的问题,实际上在Jira上有一个开放问题,因为直接运行器中无法使用id_label和timestamp_attribute(假设从阅读中了解所有非数据流运行程序).在将DataflowRunner指定为运行程序时,我已经能够成功使用id_label(还有其他问题,但这是by所致).
I have had the same problem with this today, there is actually an open issue on Jira for id_label and timestamp_attribute being unavailable in the direct runner (and I'm assuming from reading, any non dataflow runners). I've successfully been able to use id_label when specifying DataflowRunner as the runner (with some other issues, but that's by the by).
Jira问题如下:
https: //issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22
因此,目前看来,使用直接流道还不可能做到这一点.
So, at the moment, it would appear this is not yet possible to do using the direct runner.
这篇关于DataFlow(PY 2.x SDk)ReadFromPubSub :: id_label& timestamp_attribute行为异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!