Spark是否知道DataFrame的分区键? [英] Does Spark know the partitioning key of a DataFrame?
问题描述
我想知道Spark是否知道镶木地板文件的分区键,并使用此信息来避免随机播放.
上下文:
运行Spark 2.0.1,运行本地SparkSession.我有一个csv数据集,我将其另存为镶木地板文件,如下所示:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
我正在按列numerocarte
创建42个分区.这应将多个numerocarte
分组到同一分区.我不想在write
时候做partitionBy("numerocarte"),因为我不想每张卡一个分区.将会有数以百万计的人.
在那之后,在另一个脚本中,我阅读了这个SomeFile.parquet
实木复合地板文件并对其进行了一些操作.特别是,我在其上运行window function
,其中分区是在对镶木地板文件进行重新分区的同一列上完成的.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
在read
之后,我可以看到repartition
可以正常工作,并且DataFrame df2
具有42个分区,并且每个分区中都有不同的卡.
问题:
- Spark是否知道数据帧
df2
被列numerocarte
分区了? - 如果知道,则窗口功能中不会出现随机播放.是真的吗?
- 如果不知道,它将在窗口功能中进行随机播放.是真的吗?
- 如果不知道,我怎么告诉Spark数据已经被右列分区了?
- 如何检查
DataFrame
的分区键?有命令吗?我知道如何检查分区数,但如何查看分区键? - 当我在每个步骤之后打印文件中的分区数量时,我在
read
之后有42个分区,在withColumn
之后有200个分区,这表明Spark重新划分了我的DataFrame
. - 如果我用同一列对两个不同的表进行了重新分区,联接将使用该信息吗?
Spark知道数据帧df2被numerocarte列划分了吗?
不是.
如果不知道,如何告诉Spark数据已被右列分区?
你不知道.仅仅因为您保存已改组的数据,并不意味着将使用相同的拆分来加载数据.
如何检查DataFrame的分区键?
加载数据后没有分区键,但是可以检查queryExecution
中的Partitioner
.
在实践中:
- 如果要支持按键上的有效下推,请使用
DataFrameWriter
的partitionBy
方法. - 如果您想对连接优化提供有限的支持,请将
bucketBy
与元存储库和持久性表一起使用.
有关详细示例,请参见如何定义DataFrame的分区?./p>
I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.
Context:
Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:
val df0 = spark
.read
.format("csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", false)
.load("SomeFile.csv"))
val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)
df.write
.mode(SaveMode.Overwrite)
.format("parquet")
.option("inferSchema", false)
.save("SomeFile.parquet")
I am creating 42 partitions by column numerocarte
. This should group multiple numerocarte
to same partition. I don't want to do partitionBy("numerocarte") at the write
time because I don't want one partition per card. It would be millions of them.
After that in another script I read this SomeFile.parquet
parquet file and do some operations on it. In particular I am running a window function
on it where the partitioning is done on the same column that the parquet file was repartitioned by.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val df2 = spark.read
.format("parquet")
.option("header", true)
.option("inferSchema", false)
.load("SomeFile.parquet")
val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))
df2.withColumn("NewColumnName",
sum(col("dollars").over(w))
After read
I can see that the repartition
worked as expected and DataFrame df2
has 42 partitions and in each of them are different cards.
Questions:
- Does Spark know that the dataframe
df2
is partitioned by columnnumerocarte
? - If it knows, then there will be no shuffle in the window function. True?
- If it does not know, It will do a shuffle in the window function. True?
- If it does not know, how do I tell Spark the data is already partitioned by the right column?
- How can I check a partitioning key of
DataFrame
? Is there a command for this? I know how to check number of partitions but how to see partitioning key? - When I print number of partitions in a file after each step, I have 42 partitions after
read
and 200 partitions afterwithColumn
which suggests that Spark repartitioned myDataFrame
. - If I have two different tables repartitioned with the same column, would the join use that information?
Does Spark know that the dataframe df2 is partitioned by column numerocarte?
It does not.
If it does not know, how do I tell Spark the data is already partitioned by the right column?
You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.
How can I check a partitioning key of DataFrame?
There is no partitioning key once you loaded data, but you can check queryExecution
for Partitioner
.
In practice:
- If you want to support efficient pushdowns on the key, use
partitionBy
method ofDataFrameWriter
. - If you want a limited support for join optimizations use
bucketBy
with metastore and persistent tables.
See How to define partitioning of DataFrame? for detailed examples.
这篇关于Spark是否知道DataFrame的分区键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!