没有列参数的 df.repartition 对什么进行分区? [英] What does df.repartition with no column arguments partition on?

查看:27
本文介绍了没有列参数的 df.repartition 对什么进行分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 PySpark 中,重新分区模块有一个可选的列参数,它当然会通过该键重新分区您的数据框.

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

我的问题是-没有密钥时Spark如何重新分区?我无法深入挖掘源代码以找到它通过 Spark 本身的位置.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):
    """
    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
    resulting DataFrame is hash partitioned.

    :param numPartitions:
        can be an int to specify the target number of partitions or a Column.
        If it is a Column, it will be used as the first partitioning column. If not specified,
        the default number of partitions is used.

    .. versionchanged:: 1.6
       Added optional arguments to specify the partitioning columns. Also made numPartitions
       optional if partitioning columns are specified.

    >>> df.repartition(10).rdd.getNumPartitions()
    10
    >>> data = df.union(df).repartition("age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  5|  Bob|
    |  5|  Bob|
    |  2|Alice|
    |  2|Alice|
    +---+-----+
    >>> data = data.repartition(7, "age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    >>> data.rdd.getNumPartitions()
    7
    """
    if isinstance(numPartitions, int):
        if len(cols) == 0:
            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
        else:
            return DataFrame(
                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
    elif isinstance(numPartitions, (basestring, Column)):
        cols = (numPartitions, ) + cols
        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
    else:
        raise TypeError("numPartitions should be an int or Column")

例如:调用这些行完全没问题,但我不知道它实际上在做什么.它是整行的哈希吗?也许是数据框中的第一列?

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1\
       .where(sf.col('some_column') == 1)\
       .repartition(32)\
       .alias('df_2')

推荐答案

默认情况下,如果没有指定分区器,则分区不是基于数据的特征,而是以随机和统一的方式跨节点分布.

By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.

df.repartition 背后的重新分区算法进行完整的数据洗牌,并在分区之间平均分配数据.为了减少改组,最好使用 df.coalesce

The repartition algorithm behind df.repartition does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce

这里有一些很好的解释如何使用 DataFrame 重新分区https://medium.com/@mrpowers/管理spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Here is some good explanation how to repartition with DataFrame https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

这篇关于没有列参数的 df.repartition 对什么进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆