如何使用Spark确定分区键/列 [英] How to determine partition key/column with Spark

查看:141
本文介绍了如何使用Spark确定分区键/列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我使用partitionBy将一些数据保存到磁盘,例如按日期,所以我的数据如下:

Suppose I use partitionBy to save some data to disk, e.g. by date so my data looks like this:

/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...

当我使用Hive config和DataFrame读取数据时,

When I read the data using Hive config and DataFrame, so

val df = sparkSession.sql(s"select * from $database.$tableName")

我知道:

  • d列的过滤器查询将下推
  • 如果我尝试按d进行分区(例如GROUP BY d)
  • ,则不会发生任何改组
  • Filter queries on column d will push down
  • No shuffles will occur if I try to partition by d (e.g. GROUP BY d)

但是,假设我不知道分区键是什么(某些上游作业会写入数据,并且没有约定).如何让Spark告诉我哪个是分区键,在本例中为d.同样,如果我们有多个分区(例如按月,周,日).

BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d. Similarly if we have multiple partitions (e.g. by month, week, then day).

目前我们拥有的最好的代码真的很丑:

Currently the best code we have is really ugly:

def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
    val cols = sparkSession.
      sql(s"desc $databaseTableName")
      .select("col_name")
      .collect
      .map(_.getAs[String](0))
      .dropWhile(r => !r.matches("# col_name"))
    if (cols.isEmpty) {
      Set()
    } else {
      cols.tail.toSet
    }
  }

推荐答案

假定分区列值中没有=/,则可以执行以下操作:

Assuming you don't have = and / in your partitioned column values, you can do:

val df = spark.sql("show partitions database.test_table")

val partitionedCols: Set[String] = try { 
  df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
  case e: AnalysisException => Set.empty[String]
}

您应该获得带有分区列名称的Array[String].

You should get an Array[String] with the partitioned column names.

这篇关于如何使用Spark确定分区键/列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆