如何使用Spark确定分区键/列 [英] How to determine partition key/column with Spark

查看：141 发布时间：2020/9/4 20:26:45 scala apache-spark hive apache-spark-sql

本文介绍了如何使用Spark确定分区键/列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我使用partitionBy将一些数据保存到磁盘，例如按日期，所以我的数据如下:

Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:

/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...

当我使用Hive config和DataFrame读取数据时，

When I read the data using Hive config and DataFrame, so

val df = sparkSession.sql(s"select * from $database.$tableName")

我知道:

对d列的过滤器查询将下推
如果我尝试按d进行分区(例如GROUP BY d)

Filter queries on column d will push down
No shuffles will occur if I try to partition by d (e.g. GROUP BY d)

但是，假设我不知道分区键是什么(某些上游作业会写入数据，并且没有约定).如何让Spark告诉我哪个是分区键，在本例中为d.同样，如果我们有多个分区(例如按月，周，日).

BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).

目前我们拥有的最好的代码真的很丑:

Currently the best code we have is really ugly:

def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
    val cols = sparkSession.
      sql(s"desc $databaseTableName")
      .select("col_name")
      .collect
      .map(_.getAs[String](0))
      .dropWhile(r => !r.matches("# col_name"))
    if (cols.isEmpty) {
      Set()
    } else {
      cols.tail.toSet
    }
  }

推荐答案

假定分区列值中没有=和/，则可以执行以下操作:

Assuming you don't have = and / in your partitioned column values, you can do:

val df = spark.sql("show partitions database.test_table")

val partitionedCols: Set[String] = try { 
  df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
  case e: AnalysisException => Set.empty[String]
}

您应该获得带有分区列名称的Array[String].

You should get an Array[String] with the partitioned column names.

这篇关于如何使用Spark确定分区键/列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Spark确定分区键/列 [英] How to determine partition key/column with Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Spark确定分区键/列 [英] How to determine partition key/column with Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭