为什么我的 BroadcastHashJoin 比 Spark 中的 ShuffledHashJoin 慢 [英] Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

查看：20 发布时间：2021/12/15 18:35:51 hadoop apache-spark hive

本文介绍了为什么我的 BroadcastHashJoin 比 Spark 中的 ShuffledHashJoin 慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 Spark 中使用 javaHiveContext 执行连接.

I execute a join using a javaHiveContext in Spark.

大表是 1,76Gb，有 1 亿条记录.

The big table is 1,76Gb and has 100 millions record.

第二个表是 273Mb，有 1000 万条记录.

The second table is 273Mb and has 10 millions record.

我得到一个 JavaSchemaRDD 并且我在它上面调用 count() :

I get a JavaSchemaRDD and I call count() on it:

String query="select attribute7,count(*) from ft,dt where ft.chiavedt=dt.chiavedt group by attribute7";

JavaSchemaRDD rdd=sqlContext.sql(query);

System.out.println("count="+rdd.count());

如果我强制使用 broadcastHashJoin (SET spark.sql.autoBroadcastJoinThreshold=290000000) 并在 5 个具有 8 核和 20Gb 内存的节点上使用 5 个执行程序，它将在 100 秒内执行.如果我不强制广播，它会在 30 秒内执行.

If I force a broadcastHashJoin (SET spark.sql.autoBroadcastJoinThreshold=290000000) and use 5 executor on 5 node with 8 core and 20Gb of memory it is executed in 100 sec. If i don't force broadcast it is executed in 30 sec.

注意表存储为 Parquet 文件.

N.B. the tables are stored as Parquet file.

为什么我的 BroadcastHashJoin 比 Spark 中的 ShuffledHashJoin 慢 [英] Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么我的 BroadcastHashJoin 比 Spark 中的 ShuffledHashJoin 慢 [英] Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭