DataFlow(PY 2.x SDk)ReadFromPubSub :: id_label& timestamp_attribute行为异常 [英] DataFlow (PY 2.x SDk) ReadFromPubSub :: id_label & timestamp_attribute behaving unexpectedly

查看:58
本文介绍了DataFlow(PY 2.x SDk)ReadFromPubSub :: id_label& timestamp_attribute行为异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的apache梁管道(使用Python SDK + DirecrRunner进行测试…)正在从Pubsub主题中阅读

My apache beam pipeline (using Python SDK+ DirecrRunner for testing purpose…) is reading from Pubsub topic

消息&发布的属性如下:

The message & attributes published are as follows:

message: [{"col1": "test column 1", "col2": "test column 1"}]

attributes:{
  'event_time_v1': str(time.time()),
  'record_id': 'row-1’,
}

我正在使用函数

I’m using the function beam.io.gcp.pubsub.ReadFromPubSub. The code/doc mentions id_label and timestamp_attribute arguments (I believe these are very new additions?! Updated only 13 days ago..)

  1. 当我使用id_label为每个元素分配唯一的ID以进行重复数据删除时,出现以下错误:
  1. When I use id_label in order to assign each element a unique id for dedupe purpose, I get following error:

NotImplementedError:DirectRunner:PubSub读取不支持id_label``

NotImplementedError: DirectRunner: id_label is not supported for PubSub reads```

为什么这样?我的理解是正确的,我仍然缺少某些代码实现,还是我在这里缺少了什么?

  1. 当我使用timestamp_attribute = 'event_time_v1’时,为了给每个元素分配自己的时间戳(消息属性event_time_v1中传递的客户端事件时间),我注意到实际上分配给该元素的时间戳仍然是消息发布时间
  1. When I use timestamp_attribute = 'event_time_v1’, in order to assign my own timestamp to each element (client side event time passed in message attribute event_time_v1), I notice timestamp actually assigned to the element is still the message publish time

为什么这样?我希望这是event_time_v1

why so? I expected it would be the time passed in event_time_v1

我正在使用以下DoFn来打印元素的时间戳记

I'm using following DoFn to print element's timestamp

class PrintFn(beam.DoFn):

      print(element, timestamp)
      return [element]

非常感谢您对此的任何解释

Thanks a lot in advance for any explanation to that

推荐答案

我今天也遇到了同样的问题,实际上在Jira上有一个开放问题,因为直接运行器中无法使用id_label和timestamp_attribute(假设从阅读中了解所有非数据流运行程序).在将DataflowRunner指定为运行程序时,我已经能够成功使用id_label(还有其他问题,但这是by所致).

I have had the same problem with this today, there is actually an open issue on Jira for id_label and timestamp_attribute being unavailable in the direct runner (and I'm assuming from reading, any non dataflow runners). I've successfully been able to use id_label when specifying DataflowRunner as the runner (with some other issues, but that's by the by).

Jira问题如下:

https: //issues.apache.org/jira/browse/BEAM-4275?jql=text%20~%20%22python%20id_label%22

因此,目前看来,使用直接流道还不可能做到这一点.

So, at the moment, it would appear this is not yet possible to do using the direct runner.

这篇关于DataFlow(PY 2.x SDk)ReadFromPubSub :: id_label& timestamp_attribute行为异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆