Apache Beam 将数据流 pub/sub 解析为字典 [英] Apache beam parsing data flow pub/sub into a dictionary

查看:35
本文介绍了Apache Beam 将数据流 pub/sub 解析为字典的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用光束/数据流运行流式管道.我正在从 pub/sub 读取我的输入,并将其转换为如下 dict:

I am running a streaming pipeline using beam / dataflow. I am reading my input from pub/sub as converting into a dict as below:

    raw_loads_dict = (p 
      | 'ReadPubsubLoads' >> ReadFromPubSub(topic=PUBSUB_TOPIC_NAME).with_output_types(bytes)
      | 'JSONParse' >> beam.Map(lambda x: json.loads(x)) 
    )

由于这是在高吞吐量管道的每个元素上完成的,我担心这不是最有效的方法吗?

Since this is done on each element of a high throughput pipeline I am worried that this is not the most efficient way to do this?

在这种情况下,考虑到我在某些情况下操作数据,但可能直接将其流式传输到 bigquery 中,最佳做法是什么.

What is the best practice in this case, considering I am then manipulating the data in some cases, but could potentially stream it directly into bigquery.

推荐答案

这种方法很好,除非发生了效率极低的事情或者您有一些特定的担忧(例如,您观察到的某些指标似乎不正确).JSON 解析似乎足够轻量级,这不是问题.Beam pipeline runners 甚至可以潜在地融合多个类似的操作,以便它们在同一台机器上执行以提高效率,从而避免在工作机器之间传输数据.

This approach is fine unless there's something extremely inefficient happening or you have some specific concern (e.g. some metric you observe doesn't seem right). JSON parsing seems lightweight enough for this not to be a problem. Beam pipeline runners can even potentially fuse multiple operations like that so that they are executed on the same machine for efficiency to avoid transferring data between worker machines.

您开始发现性能问题的主要情况可能涉及外部系统(例如调用外部服务时的网络延迟或节流),或需要聚合数据的分组操作(例如使用 GroupByKey/CoGroupByKey 实现连接)在某个地方的持久存储中,需要在工作机器之间传输(随机操作).在这些情况下,与网络、持久性和其他相关成本相比,JSON 解析或运行每个元素的一些相对简单的转换代码的成本可能可以忽略不计.

A major situation where you can start seeing performance issues would probably involve either external systems (e.g. network latencies or throttling when calling external services), or grouping operations (e.g. implementing joins using GroupByKey/CoGroupByKey) where the data needs to be aggregated in a persistent store somewhere and needs to be transferred between worker machines (shuffle operation). In these situations though the costs of JSON parsing or running some relatively simple transformation code per-element would likely be negligible compared to network, persistence and other related costs.

这篇关于Apache Beam 将数据流 pub/sub 解析为字典的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆