没有列参数分区的df.repartition会做什么? [英] What does df.repartition with no column arguments partition on?

查看：442 发布时间：2020/9/4 20:12:29 python apache-spark pyspark pyspark-sql

本文介绍了没有列参数分区的df.repartition会做什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在PySpark中，重新分区模块具有一个可选的columns参数，该参数当然将通过该键对您的数据框进行重新分区.

In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.

我的问题是-当没有密钥时，Spark如何重新分区?我无法进一步深入研究源代码以找到Spark本身的作用.

My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.

def repartition(self, numPartitions, *cols):
    """
    Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
    resulting DataFrame is hash partitioned.

    :param numPartitions:
        can be an int to specify the target number of partitions or a Column.
        If it is a Column, it will be used as the first partitioning column. If not specified,
        the default number of partitions is used.

    .. versionchanged:: 1.6
       Added optional arguments to specify the partitioning columns. Also made numPartitions
       optional if partitioning columns are specified.

    >>> df.repartition(10).rdd.getNumPartitions()
    10
    >>> data = df.union(df).repartition("age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  5|  Bob|
    |  5|  Bob|
    |  2|Alice|
    |  2|Alice|
    +---+-----+
    >>> data = data.repartition(7, "age")
    >>> data.show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|
    |  5|  Bob|
    |  2|Alice|
    |  5|  Bob|
    +---+-----+
    >>> data.rdd.getNumPartitions()
    7
    """
    if isinstance(numPartitions, int):
        if len(cols) == 0:
            return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
        else:
            return DataFrame(
                self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
    elif isinstance(numPartitions, (basestring, Column)):
        cols = (numPartitions, ) + cols
        return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
    else:
        raise TypeError("numPartitions should be an int or Column")

例如:调用这些行完全可以，但是我不知道它实际上在做什么.它是整个行的哈希吗?也许是数据框中的第一列?

For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?

df_2 = df_1\
       .where(sf.col('some_column') == 1)\
       .repartition(32)\
       .alias('df_2')

没有列参数分区的df.repartition会做什么? [英] What does df.repartition with no column arguments partition on?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

没有列参数分区的df.repartition会做什么? [英] What does df.repartition with no column arguments partition on?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭