没有列参数分区的df.repartition会做什么? [英] What does df.repartition with no column arguments partition on?

查看:442
本文介绍了没有列参数分区的df.repartition会做什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在PySpark中,重新分区模块具有一个可选的columns参数,该参数当然将通过该键对您的数据框进行重新分区.

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

我的问题是-当没有密钥时,Spark如何重新分区?我无法进一步深入研究源代码以找到Spark本身的作用.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):
    """
    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
    resulting DataFrame is hash partitioned.

    :param numPartitions:
        can be an int to specify the target number of partitions or a Column.
        If it is a Column, it will be used as the first partitioning column. If not specified,
        the default number of partitions is used.

    .. versionchanged:: 1.6
       Added optional arguments to specify the partitioning columns. Also made numPartitions
       optional if partitioning columns are specified.

    >>> df.repartition(10).rdd.getNumPartitions()
    10
    >>> data = df.union(df).repartition("age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  5|  Bob|
    |  5|  Bob|
    |  2|Alice|
    |  2|Alice|
    +---+-----+
    >>> data = data.repartition(7, "age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    >>> data.rdd.getNumPartitions()
    7
    """
    if isinstance(numPartitions, int):
        if len(cols) == 0:
            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
        else:
            return DataFrame(
                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
    elif isinstance(numPartitions, (basestring, Column)):
        cols = (numPartitions, ) + cols
        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
    else:
        raise TypeError("numPartitions should be an int or Column")

例如:调用这些行完全可以,但是我不知道它实际上在做什么.它是整个行的哈希吗?也许是数据框中的第一列?

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1\
       .where(sf.col('some_column') == 1)\
       .repartition(32)\
       .alias('df_2')

推荐答案

默认情况下,如果未指定分区程序,则该分区不是基于数据的特征,而是以随机且统一的方式分布在各个节点上.

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

df.repartition后面的重新分区算法会进行完整的数据混洗,并在分区之间平均分配数据.为了减少混洗,最好使用df.coalesce

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

以下是一些很好的解释,说明如何使用DataFrame重新分区 https://hackernoon.com/managing-spark-partitions-with -coalesce-and-partition-4050c57ad5c4

Here is some good explanation how to repartition with DataFrame https://hackernoon.com/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

这篇关于没有列参数分区的df.repartition会做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆