Spark是否知道DataFrame的分区键? [英] Does Spark know the partitioning key of a DataFrame?

查看:150
本文介绍了Spark是否知道DataFrame的分区键?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道Spark是否知道镶木地板文件的分区键,并使用此信息来避免随机播放.

上下文:

运行Spark 2.0.1,运行本地SparkSession.我有一个csv数据集,我将其另存为镶木地板文件,如下所示:

val df0 = spark
  .read
  .format("csv")
  .option("header", true)
  .option("delimiter", ";")
  .option("inferSchema", false)
  .load("SomeFile.csv"))


val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)

df.write
  .mode(SaveMode.Overwrite)
  .format("parquet")
  .option("inferSchema", false)
  .save("SomeFile.parquet")

我正在按列numerocarte创建42个分区.这应将多个numerocarte分组到同一分区.我不想在write时候做partitionBy("numerocarte"),因为我不想每张卡一个分区.将会有数以百万计的人.

在那之后,在另一个脚本中,我阅读了这个SomeFile.parquet实木复合地板文件并对其进行了一些操作.特别是,我在其上运行window function,其中分区是在对镶木地板文件进行重新分区的同一列上完成的.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df2 = spark.read
  .format("parquet")
  .option("header", true)
  .option("inferSchema", false)
  .load("SomeFile.parquet")

val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))

df2.withColumn("NewColumnName",
      sum(col("dollars").over(w))

read之后,我可以看到repartition可以正常工作,并且DataFrame df2具有42个分区,并且每个分区中都有不同的卡.

问题:

  1. Spark是否知道数据帧df2被列numerocarte分区了?
  2. 如果知道,则窗口功能中不会出现随机播放.是真的吗?
  3. 如果不知道,它将在窗口功能中进行随机播放.是真的吗?
  4. 如果不知道,我怎么告诉Spark数据已经被右列分区了?
  5. 如何检查DataFrame的分区键?有命令吗?我知道如何检查分区数,但如何查看分区键?
  6. 当我在每个步骤之后打印文件中的分区数量时,我在read之后有42个分区,在withColumn之后有200个分区,这表明Spark重新划分了我的DataFrame.
  7. 如果我用同一列对两个不同的表进行了重新分区,联接将使用该信息吗?

解决方案

Spark知道数据帧df2被numerocarte列划分了吗?

不是.

如果不知道,如何告诉Spark数据已被右列分区?

你不知道.仅仅因为您保存已改组的数据,并不意味着将使用相同的拆分来加载数据.

如何检查DataFrame的分区键?

加载数据后没有分区键,但是可以检查queryExecution中的Partitioner.


在实践中:

  • 如果要支持按键上的有效下推,请使用DataFrameWriterpartitionBy方法.
  • 如果您想对连接优化提供有限的支持,请将bucketBy与元存储库和持久性表一起使用.

有关详细示例,请参见如何定义DataFrame的分区?./p>

I want to know if Spark knows the partitioning key of the parquet file and uses this information to avoid shuffles.

Context:

Running Spark 2.0.1 running local SparkSession. I have a csv dataset that I am saving as parquet file on my disk like so:

val df0 = spark
  .read
  .format("csv")
  .option("header", true)
  .option("delimiter", ";")
  .option("inferSchema", false)
  .load("SomeFile.csv"))


val df = df0.repartition(partitionExprs = col("numerocarte"), numPartitions = 42)

df.write
  .mode(SaveMode.Overwrite)
  .format("parquet")
  .option("inferSchema", false)
  .save("SomeFile.parquet")

I am creating 42 partitions by column numerocarte. This should group multiple numerocarte to same partition. I don't want to do partitionBy("numerocarte") at the write time because I don't want one partition per card. It would be millions of them.

After that in another script I read this SomeFile.parquet parquet file and do some operations on it. In particular I am running a window function on it where the partitioning is done on the same column that the parquet file was repartitioned by.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val df2 = spark.read
  .format("parquet")
  .option("header", true)
  .option("inferSchema", false)
  .load("SomeFile.parquet")

val w = Window.partitionBy(col("numerocarte"))
.orderBy(col("SomeColumn"))

df2.withColumn("NewColumnName",
      sum(col("dollars").over(w))

After read I can see that the repartition worked as expected and DataFrame df2 has 42 partitions and in each of them are different cards.

Questions:

  1. Does Spark know that the dataframe df2 is partitioned by column numerocarte?
  2. If it knows, then there will be no shuffle in the window function. True?
  3. If it does not know, It will do a shuffle in the window function. True?
  4. If it does not know, how do I tell Spark the data is already partitioned by the right column?
  5. How can I check a partitioning key of DataFrame? Is there a command for this? I know how to check number of partitions but how to see partitioning key?
  6. When I print number of partitions in a file after each step, I have 42 partitions after read and 200 partitions after withColumn which suggests that Spark repartitioned my DataFrame.
  7. If I have two different tables repartitioned with the same column, would the join use that information?

解决方案

Does Spark know that the dataframe df2 is partitioned by column numerocarte?

It does not.

If it does not know, how do I tell Spark the data is already partitioned by the right column?

You don't. Just because you save data which has been shuffled, it does not mean, that it will be loaded with the same splits.

How can I check a partitioning key of DataFrame?

There is no partitioning key once you loaded data, but you can check queryExecution for Partitioner.


In practice:

  • If you want to support efficient pushdowns on the key, use partitionBy method of DataFrameWriter.
  • If you want a limited support for join optimizations use bucketBy with metastore and persistent tables.

See How to define partitioning of DataFrame? for detailed examples.

这篇关于Spark是否知道DataFrame的分区键?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆