Hive分区,Spark分区和Spark中的联接-它们之间的关系 [英] Hive partitions, Spark partitions and joins in Spark - how they relate

查看:246
本文介绍了Hive分区,Spark分区和Spark中的联接-它们之间的关系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图了解Hive分区与Spark分区之间的关系,最终导致有关联接的问题.

Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins.

我有2个外部Hive表;均由S3存储桶支持并由date分区;因此,每个存储桶中都有名称格式为date=<yyyy-MM-dd>/<filename>的键.

I have 2 external Hive tables; both backed by S3 buckets and partitioned by date; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename>.

问题1:

如果我将这些数据读入Spark:

If I read this data into Spark:

val table1 = spark.table("table1").as[Table1Row]
val table2 = spark.table("table2").as[Table2Row]

那么结果数据集将分别具有多少个分区?分区等于S3中的对象数量吗?

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

问题2 :

假设两个行类型具有以下架构:

Suppose the two row types have the following schema:

Table1Row(date: Date, id: String, ...)
Table2Row(date: Date, id: String, ...)

并且我想在dateid字段上加入table1table2:

and that I want to join table1 and table2 on the fields date and id:

table1.joinWith(table2,
  table1("date") === table2("date") && 
    table1("id") === table2("id")
)

Spark是否能够利用以下事实:要连接的字段之一是Hive表中的分区键来优化连接?如果可以,怎么办?

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? And if so how?

问题3 :

现在假设我改用RDD s:

val rdd1 = table1.rdd
val rdd2 = table2.rdd

AFAIK,使用RDD API进行联接的语法类似于:

AFAIK, the syntax for the join using the RDD API would look something like:

rdd1.map(row1 => ((row1.date, row1.id), row1))
  .join(rdd2.map(row2 => ((row2.date, row2.id), row2))))

再次,Spark是否能够利用Hive表中的分区键在联接中使用这一事实?

Again, is Spark going to be able to utilize the fact that the partition key in the Hive tables is being used in the join?

推荐答案

那么结果数据集将分别具有多少个分区?分区等于S3中的对象数量吗?

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

无法回答您提供的给定信息. 最新版本的分区数量主要取决于spark.sql.files.maxPartitionByte,尽管其他因素也可以起到一定作用. >

Impossible to answer given information you've provided. Number of partitions in latest versions depends on primarily on spark.sql.files.maxPartitionByte, although other factors can play some role as well.

Spark是否能够利用被连接的字段之一是Hive表中的分区键这一事实来优化连接?

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join?

截止到今天(Spark 2.3.0),但是Spark可以利用存储桶(DISTRIBUTE BY)来优化联接.请参阅如何定义DataFrame的分区?.一旦Data Source API v2稳定下来,这种情况将来可能会改变.

Not as of today (Spark 2.3.0), however Spark can utilize bucketing (DISTRIBUTE BY) to optimize joins. See How to define partitioning of DataFrame?. This might change in the future, once Data Source API v2 stabilizes.

现在假设我改为使用RDD(...),Spark是否能够利用Hive表中的分区键在联接中使用的事实?

Suppose now that I am using RDDs instead (...) Again, is Spark going to be able to utilise the fact that the partition key in the Hive tables is being used in the join?

一点也不.即使数据存储在桶中,RDD转换和功能性Dataset转换都是黑盒.无法应用任何优化,请在此处应用.

Not at all. Even if data is bucketed RDD transformations and functional Dataset transformations are black boxes. No optimization can be applied and are applied here.

这篇关于Hive分区,Spark分区和Spark中的联接-它们之间的关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆