为什么预分区会因减少洗牌而受益? [英] Why pre-partition will benefit spark job because of reducing shuffling?

查看:172
本文介绍了为什么预分区会因减少洗牌而受益?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多教程提到, RDD 的预分区将优化火花作业的数据混洗。我感到困惑的是,因为我的理解,预分区也会导致洗牌,为什么在这里提前洗牌会有利于某些操作?特别是将它自动激发,自我将对一组转换进行优化。



例如:

如果我想要要加入两个数据集国家(id,国家)和收入(id,(收入,月份,年份)),这两种操作有什么区别? (我使用PySpark模式)


  1. 通过id预分区



    <$ (10).persist()
    收入= income.partitionBy(10).persist()
    income.join(country)
    $ / code>


  2. 无需预分区即可直接加入:

      income.join(country)


如果我只需计算一次这个连接,在连接之前使用预分区仍然有用吗?我认为 partitionBy 也需要洗牌吗?如果我加入后的进一步计算全部基于使用国家作为关键字(以前用于连接的关键字ID将毫无用处并被从 RDD >中删除),我该怎么做优化计算?

解决方案


如果我只需计算一次这个连接,它仍然有用在加入前使用预分区?我认为partitionBy也需要洗牌吗?


你完全正确。仅当分区数据将被重复用于多个DAG路径时,抢先分区才有意义。如果你加入只有一次,它只是在不同的地方洗牌。

Many tutorials mention that pre-partition of RDD will optimize data shuffling of spark jobs. What I'm confused is that, for my understanding pre-partition will also lead to shuffling, why shuffling in advance here will benefit some operation? Especially spark it self will do the optimization for a set of transformations.

For example:

If I want to join two dataset country (id, country) and income (id, (income, month, year)), what's the difference between this two kind of operation? (I use PySpark schema)

  1. pre-partition by id

    country = country.partitionBy(10).persist()
    income = income.partitionBy(10).persist()
    income.join(country)
    

  2. directly join without pre-partition:

    income.join(country)
    

If I only need to calculate this join once, is it still useful to use pre-partition before join? I think partitionBy also requires shuffling right? And if my further computation after join is all base on using country as key (previous key id used for join will be useless and be eliminated from RDD), what should I do to optimize the calculation?

解决方案

If I only need to calculate this join once, is it still useful to use pre-partition before join? I think partitionBy also requires shuffling right?

You're perfectly right. Preemptive partitioning makes sense only if partitioned data will be reused for multiple DAG paths. If you join only once it just shuffles in a different place.

这篇关于为什么预分区会因减少洗牌而受益?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆