Hive分区，Spark分区和Spark中的联接-它们之间的关系 [英] Hive partitions, Spark partitions and joins in Spark - how they relate

查看：246 发布时间：2020/9/4 8:24:03 apache-spark hive apache-spark-sql apache-spark-dataset

本文介绍了Hive分区，Spark分区和Spark中的联接-它们之间的关系的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

试图了解Hive分区与Spark分区之间的关系，最终导致有关联接的问题.

Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins.

我有2个外部Hive表；均由S3存储桶支持并由date分区；因此，每个存储桶中都有名称格式为date=<yyyy-MM-dd>/<filename>的键.

I have 2 external Hive tables; both backed by S3 buckets and partitioned by date; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename>.

问题1:

如果我将这些数据读入Spark:

If I read this data into Spark:

val table1 = spark.table("table1").as[Table1Row]
val table2 = spark.table("table2").as[Table2Row]

那么结果数据集将分别具有多少个分区?分区等于S3中的对象数量吗?

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

问题2 :

假设两个行类型具有以下架构:

Suppose the two row types have the following schema:

Table1Row(date: Date, id: String, ...)
Table2Row(date: Date, id: String, ...)

并且我想在date和id字段上加入table1和table2:

and that I want to join table1 and table2 on the fields date and id:

table1.joinWith(table2,
  table1("date") === table2("date") && 
    table1("id") === table2("id")
)

Spark是否能够利用以下事实:要连接的字段之一是Hive表中的分区键来优化连接?如果可以，怎么办?

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? And if so how?

问题3 :

现在假设我改用RDD s:

val rdd1 = table1.rdd
val rdd2 = table2.rdd

AFAIK，使用RDD API进行联接的语法类似于:

AFAIK, the syntax for the join using the RDD API would look something like:

rdd1.map(row1 => ((row1.date, row1.id), row1))
  .join(rdd2.map(row2 => ((row2.date, row2.id), row2))))

再次，Spark是否能够利用Hive表中的分区键在联接中使用这一事实?

Again, is Spark going to be able to utilize the fact that the partition key in the Hive tables is being used in the join?

Hive分区，Spark分区和Spark中的联接-它们之间的关系 [英] Hive partitions, Spark partitions and joins in Spark - how they relate

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hive分区，Spark分区和Spark中的联接-它们之间的关系 [英] Hive partitions, Spark partitions and joins in Spark - how they relate

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭