如何使用Spark确定分区键/列 [英] How to determine partition key/column with Spark
问题描述
假设我使用partitionBy
将一些数据保存到磁盘,例如按日期,所以我的数据如下:
Suppose I use partitionBy
to save some data to disk, e.g. by date so my data looks like this:
/mydata/d=01-01-2018/part-00000
/mydata/d=01-01-2018/part-00001
...
/mydata/d=02-01-2018/part-00000
/mydata/d=02-01-2018/part-00001
...
当我使用Hive config和DataFrame
读取数据时,
When I read the data using Hive config and DataFrame
, so
val df = sparkSession.sql(s"select * from $database.$tableName")
我知道:
- 对
d
列的过滤器查询将下推 - 如果我尝试按
d
进行分区(例如GROUP BY d
) ,则不会发生任何改组
- Filter queries on column
d
will push down - No shuffles will occur if I try to partition by
d
(e.g.GROUP BY d
)
但是,假设我不知道分区键是什么(某些上游作业会写入数据,并且没有约定).如何让Spark告诉我哪个是分区键,在本例中为d
.同样,如果我们有多个分区(例如按月,周,日).
BUT, suppose I don't know what the partition key is (some upstream job writes the data, and has no conventions). How can I get Spark to tell me which is the partition key, in this case d
. Similarly if we have multiple partitions (e.g. by month, week, then day).
目前我们拥有的最好的代码真的很丑:
Currently the best code we have is really ugly:
def getPartitionColumnsForHiveTable(databaseTableName: String)(implicit sparkSession: SparkSession): Set[String] = {
val cols = sparkSession.
sql(s"desc $databaseTableName")
.select("col_name")
.collect
.map(_.getAs[String](0))
.dropWhile(r => !r.matches("# col_name"))
if (cols.isEmpty) {
Set()
} else {
cols.tail.toSet
}
}
推荐答案
假定分区列值中没有=
和/
,则可以执行以下操作:
Assuming you don't have =
and /
in your partitioned column values, you can do:
val df = spark.sql("show partitions database.test_table")
val partitionedCols: Set[String] = try {
df.map(_.getAs[String](0)).first.split('/').map(_.split("=")(0)).toSet
} catch {
case e: AnalysisException => Set.empty[String]
}
您应该获得带有分区列名称的Array[String]
.
You should get an Array[String]
with the partitioned column names.
这篇关于如何使用Spark确定分区键/列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!