在Spark中歪斜 [英] Skewed By in Spark

查看:67
本文介绍了在Spark中歪斜的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,我想按一个特定的键(clientID)进行分区,但是有些客户端产生的数据要比其他客户端多得多. Hive中有一个功能称为" ListBucketing "由"偏向调用"专门处理这种情况.

I have a dataset that I want to partition by a particular key (clientID) but some clients produce far, far more data that others. There's a feature in Hive called either "ListBucketing" invoked by "skewed by" specifically to deal with this situation.

但是,我找不到任何迹象表明Spark支持此功能,或者如何(如果确实支持)使用此功能.

However, I cannot find any indication that Spark supports this feature, or how (if it does support it) to make use of it.

是否有与之等效的Spark功能?或者,Spark是否具有其他一些功能集可以用来复制此行为?

Is there a Spark feature that is the equivalent? Or, does Spark have some other set of features by which this behavior can be replicated?

(作为奖励,并且是我实际用例的要求,您的建议方法是否可以与Amazon Athena一起使用?)

(As a bonus - and requirement for my actual use-case - does your suggest method work with Amazon Athena?)

推荐答案

据我所知,Spark中没有开箱即用的工具.如果数据不正确,最常见的是添加一个人工列以进一步对数据进行存储.

As far as I know, there is no such out of the box tool in Spark. In case of skewed data, what's very common is to add an artificial column to further bucketize the data.

假设您要按"y"列进行分区,但是数据却像这个玩具示例中那样非常偏斜(1个分区有5行,其他分区只有1行):

Let's say you want to partition by column "y", but the data is very skewed like in this toy example (1 partition with 5 rows, the others with only one row):

val df = spark.range(8).withColumn("y", when('id < 5, 0).otherwise('id))
df.show()
+---+---+
| id|  y|
+---+---+
|  0|  0|
|  1|  0|
|  2|  0|
|  3|  0|
|  4|  0|
|  5|  5|
|  6|  6|
|  7|  7|
+-------+

现在,我们添加一个人工随机列并写入数据框.

Now let's add an artificial random column and write the dataframe.

val maxNbOfBuckets = 3
val part_df = df.withColumn("r", floor(rand() * nbOfBuckets))
part_df.show
+---+---+---+
| id|  y|  r|
+---+---+---+
|  0|  0|  2|
|  1|  0|  2|
|  2|  0|  0|
|  3|  0|  0|
|  4|  0|  1|
|  5|  5|  2|
|  6|  6|  2|
|  7|  7|  1|
+---+---+---+

// and writing. We divided the partition with 5 elements into 3 partitions.
part_df.write.partitionBy("y", "r").csv("...")

这篇关于在Spark中歪斜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆