哪一个效果更好,广播变量还是广播连接? [英] Which one will perform better, broadcast variable or broadcast join?

查看:94
本文介绍了哪一个效果更好,广播变量还是广播连接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的项目中使用Spark 2.4.1和Java 8.

I am using Spark 2.4.1 with Java 8 in my project.

在一种情况下,我需要查找另一个具有两个字段(即国家/地区名称和国家/地区代码)的表/数据集.

I have a scenario where I need to look-up another table/dataset which has two fields i.e. country-name and country-code.

另一个流数据将在其中包含国家/地区代码列,我需要在目标/结果数据框中映射相应的国家/地区名称.

Another stream-data will have country-code column in it, I need to map respective country-name in the target/result dataframe.

据我所知,我们可以使用join来实现上述目的,可以使用广播变量和joining.

As far as I know, we can use join to achieve the above, using broadcast variable and joining.

那么从性能的角度来看,哪一个更好?什么是 处理这类用例的火花标准?

So from performance point of view which one is better here? What is the spark standard to handle this kind of use-cases?

推荐答案

说实话,由于它们实际上在做同一件事,因此它们应该具有相似的表现.

Quite honestly they should perform similarly, since they are effectively doing the same thing.

允许spark固有地进行广播联接可能有一个很小的优势,但这可能取决于您的事实表大小和广播变量开销的总体影响.

There may be a very slight advantage to allowing spark to do the broadcast join inherently, but it likely depends on your fact table size and overall effect of a broadcast variable's overhead.

要注意的一件事,是默认广播阈值仅为10MiB,因此,如果尺寸表大于10MiB,则需要明确使用

One thing to take note of, the default broadcast threshold is only 10MiB, so if your dimension table is larger than that, you'll want to explicitly use the broadcast() hint.

这篇关于哪一个效果更好,广播变量还是广播连接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆