Apache将数据流pub/sub解析为字典 [英] Apache beam parsing data flow pub/sub into a dictionary

查看:68
本文介绍了Apache将数据流pub/sub解析为字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Beam/Dataflow运行流传输管道.我正在读取来自pub/sub的输入,将其转换为如下所示的dict:

I am running a streaming pipeline using beam / dataflow. I am reading my input from pub/sub as converting into a dict as below:

    raw_loads_dict = (p 
      | 'ReadPubsubLoads' >> ReadFromPubSub(topic=PUBSUB_TOPIC_NAME).with_output_types(bytes)
      | 'JSONParse' >> beam.Map(lambda x: json.loads(x)) 
    )

由于这是在高吞吐量管道的每个元素上完成的,我担心这不是最有效的方法吗?

Since this is done on each element of a high throughput pipeline I am worried that this is not the most efficient way to do this?

在这种情况下,最好的做法是什么,考虑到某些情况下我将要处理数据,但有可能将其直接流式传输到bigquery中.

What is the best practice in this case, considering I am then manipulating the data in some cases, but could potentially stream it directly into bigquery.

推荐答案

这种方法很好,除非发生了效率极低的事情,或者您有一些特别的担忧(例如,您观察到的某些指标似乎不正确). JSON解析似乎足够轻巧,因此不会成为问题.梁管道流道甚至可以潜在地融合诸如此类的多个操作,以便在同一台机器上执行它们以提高效率,从而避免在工人机器之间传输数据.

This approach is fine unless there's something extremely inefficient happening or you have some specific concern (e.g. some metric you observe doesn't seem right). JSON parsing seems lightweight enough for this not to be a problem. Beam pipeline runners can even potentially fuse multiple operations like that so that they are executed on the same machine for efficiency to avoid transferring data between worker machines.

您可以开始看到性能问题的主要情况可能是涉及需要聚合数据的外部系统(例如,网络延迟或调用外部服务时的限制)或分组操作(例如,使用GroupByKey/CoGroupByKey实施联接)存放在某处的持久性存储中,需要在工作计算机之间传输(随机操作).在这些情况下,尽管与网络,持久性和其他相关成本相比,JSON解析或运行每个元素相对简单的转换代码的成本可能微不足道.

A major situation where you can start seeing performance issues would probably involve either external systems (e.g. network latencies or throttling when calling external services), or grouping operations (e.g. implementing joins using GroupByKey/CoGroupByKey) where the data needs to be aggregated in a persistent store somewhere and needs to be transferred between worker machines (shuffle operation). In these situations though the costs of JSON parsing or running some relatively simple transformation code per-element would likely be negligible compared to network, persistence and other related costs.

这篇关于Apache将数据流pub/sub解析为字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆