没有列参数的 df.repartition 对什么进行分区? [英] What does df.repartition with no column arguments partition on?
问题描述
在 PySpark 中,重新分区模块有一个可选的列参数,它当然会通过该键重新分区您的数据框.
In PySpark the repartition module has an optional columns argument which will of course repartition your dataframe by that key.
我的问题是-没有密钥时Spark如何重新分区?我无法深入挖掘源代码以找到它通过 Spark 本身的位置.
My question is - how does Spark repartition when there's no key? I couldn't dig any further into the source code to find where this goes through Spark itself.
def repartition(self, numPartitions, *cols):
"""
Returns a new :class:`DataFrame` partitioned by the given partitioning expressions. The
resulting DataFrame is hash partitioned.
:param numPartitions:
can be an int to specify the target number of partitions or a Column.
If it is a Column, it will be used as the first partitioning column. If not specified,
the default number of partitions is used.
.. versionchanged:: 1.6
Added optional arguments to specify the partitioning columns. Also made numPartitions
optional if partitioning columns are specified.
>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 5| Bob|
| 5| Bob|
| 2|Alice|
| 2|Alice|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
| 2|Alice|
| 5| Bob|
| 2|Alice|
| 5| Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
"""
if isinstance(numPartitions, int):
if len(cols) == 0:
return DataFrame(self._jdf.repartition(numPartitions), self.sql_ctx)
else:
return DataFrame(
self._jdf.repartition(numPartitions, self._jcols(*cols)), self.sql_ctx)
elif isinstance(numPartitions, (basestring, Column)):
cols = (numPartitions, ) + cols
return DataFrame(self._jdf.repartition(self._jcols(*cols)), self.sql_ctx)
else:
raise TypeError("numPartitions should be an int or Column")
例如:调用这些行完全没问题,但我不知道它实际上在做什么.它是整行的哈希吗?也许是数据框中的第一列?
For example: it's totally fine to call these lines but I have no idea what it's actually doing. Is it a hash of the entire line? Perhaps the first column in the dataframe?
df_2 = df_1\
.where(sf.col('some_column') == 1)\
.repartition(32)\
.alias('df_2')
推荐答案
默认情况下,如果没有指定分区器,则分区不是基于数据的特征,而是以随机和统一的方式跨节点分布.
By default, If there is no partitioner specified the partitioning is not based upon characteristic of data but it is distributed in random and uniform way across nodes.
df.repartition
背后的重新分区算法进行完整的数据洗牌,并在分区之间平均分配数据.为了减少改组,最好使用 df.coalesce
The repartition algorithm behind df.repartition
does a full data shuffle and equally distributes the data among the partitions. To reduce shuffling it is better to use df.coalesce
这里有一些很好的解释如何使用 DataFrame
重新分区https://medium.com/@mrpowers/管理spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
Here is some good explanation how to repartition with DataFrame
https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4
这篇关于没有列参数的 df.repartition 对什么进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!