spark:salting 如何处理偏斜数据 [英] spark: How does salting work in dealing with skewed data

查看:29
本文介绍了spark:salting 如何处理偏斜数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个表中有一个倾斜的数据,然后将其与其他较小的表进行比较.我知道在连接的情况下加盐工作 - 即随机数附加到大表中的键,其中包含来自一系列随机数据的倾斜数据,并且小表中没有倾斜数据的行与相同范围的随机数重复.因此匹配的发生是因为在 skewedable 的特定指定键的重复值中会有一个命中我还读到在执行 groupby 时加盐很有帮助.我的问题是当随机数附加到密钥时,它不会破坏组吗?如果是,则 group by operation 的含义发生了变化.

I have a skewed data in a table which is then compared with other table that is small. I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the the matching happens because there will be a hit in one among the duplicate values for particular slated key of skewed able I also read that salting is helpful while performing groupby. My question is when random numbers are appended to the key doesn't it break the group? If if does then the meaning of group by operation has changed.

推荐答案

我的问题是,当随机数附加到密钥时,它不会破坏组吗?

My question is when random numbers are appended to the key doesn't it break the group?

嗯,确实如此,为了缓解这种情况,您可以按操作运行两次.先用加盐键,然后再去掉加盐和分组.第二个分组将采用部分聚合的数据,从而显着降低倾斜影响.

Well, it does, to mitigate this you could run group by operation twice. Firstly with salted key, then remove salting and group again. The second grouping will take partially aggregated data, thus significantly reduce skew impact.

例如

import org.apache.spark.sql.functions._

df.withColumn("salt", (rand * n).cast(IntegerType))
  .groupBy("salt", groupByFields)
  .agg(aggFields)
  .groupBy(groupByFields)
  .agg(aggFields)

这篇关于spark:salting 如何处理偏斜数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆