Spark DataFrame分区器为“无" [英] Spark DataFrame partitioner is None
问题描述
[Spark的新手]创建了DataFrame之后,我尝试根据DataFrame中的列对其进行分区.当我使用 data_frame.rdd.partitioner
检查分区程序时,我得到 None 作为输出.
[New to Spark]
After creating a DataFrame I am trying to partition it based on a column in the DataFrame. When I check the partitioner using data_frame.rdd.partitioner
I get None as output.
使用->
data_frame.repartition("column_name")
根据Spark文档,默认分区程序是HashPartitioner,如何确认呢?
As per Spark documentation the default partitioner is HashPartitioner, how can I confirm that ?
此外,如何更改分区程序?
Also, how can I change the partitioner ?
推荐答案
这是可以预期的.从 Dataset
转换的 RDD
不保留分区程序,仅数据分发.
That's to be expected. RDD
converted from a Dataset
doesn't preserve the partitioner, only the data distribution.
如果要检查RDD的分区程序,则应从 queryExecution
检索它:
If you want to inspect partitioner of the RDD you should retrieve it from the queryExecution
:
scala> val df = spark.range(100).select($"id" % 3 as "id").repartition(42, $"id")
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: bigint]
scala> df.queryExecution.toRdd.partitioner
res1: Option[org.apache.spark.Partitioner] = Some(org.apache.spark.sql.execution.CoalescedPartitioner@4be2340e)
如何更改分区器?
how can I change the partitioner ?
通常您不能.存在 repartitionByRange
方法(请参见链接的线程),但是否则, Dataset
Partitioner
是不可配置的.
In general you cannot. There exist repartitionByRange
method (see the linked thread), but otherwise Dataset
Partitioner
is not configurable.
这篇关于Spark DataFrame分区器为“无"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!