如何从年月日分区列的列表中提取最新/最近的分区 [英] How to extract latest/recent partition from the list of year month day partition columns
问题描述
我在spark sql中使用了show分区,这给了我以下内容:
I have used show partitions in spark sql which gives me the following:
year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27
- 我需要提取最新分区
- 我需要分别指定年,月和日,因此可以在另一个数据框中将其用作变量.即:
part_year=2019
part_month=1
part_day=29
我用过:
val overwrite2 = overwrite.select(col("partition",8,8) as year
我从中得到
2019/month
要删除此数据,我使用另一个数据框,其中使用regex_replace
将月份替换为空白,以便创建另一个数据框.
For removing this I use another dataframe where I use regex_replace
to replace month with blank so another dataframe is created.
这反过来又增加了很多开销.我想要的是将所有这些步骤都在一个数据帧中完成,这样我就可以将结果数据帧获取为:
This is in turn creating a lot of overhead. What I want is for all these steps to be done in one dataframe so I can get the resultant dataframe as:
part_year=2019
part_month=2
part_day=27
正在选择最新的分区.
推荐答案
问题:如何从年月日列表中提取最新/最近的分区 分区列
Question : How to extract latest/recent partition from the list of year month day partition columns
1)我需要提取最新的分区.
1) I need to extract latest partition.
2)我需要分别指定年,月和日,以便可以在 另一个数据框作为变量.
2) I need to the year, month and day separately so I can use it in another dataframe as variables.
- 由于最终目标是获取最新/最近的分区...您可以通过按
isAfter
进行排序来使用joda apiDateTime
来获取最新的分区,如下例所示. - Since final goal is to get latest/recent partition... You can use joda api
DateTime
by sorting withisAfter
to get latest partition like given as below example.
在spark.sql(s"show Partitions $yourtablename")
之后,您将获得一个数据帧collect
,因为它的小数据没有问题.
After spark.sql(s"show Partitions $yourtablename")
you will get a dataframe collect
that since its small data no issue.
一旦收集了数据帧分区,您将得到一个像这样的数组
once you collect the dataframe partitions you will get an array like this
val x = Array(
"year=2019/month=1/day=21",
"year=2019/month=1/day=22",
"year=2019/month=1/day=23",
"year=2019/month=1/day=24",
"year=2019/month=1/day=25",
"year=2019/month=1/day=26",
"year=2019/month=2/day=27"
)
val finalPartitions = listKeys()
import org.joda.time.DateTime
def listKeys(): Seq[Map[String, DateTime]] = {
val keys: Seq[DateTime] = x.map(row => {
println(s" Identified Key: ${row.toString()}")
DateTime.parse(row.replaceAll("/", "")
.replaceAll("year=", "")
.replaceAll("month=", "-")
.replaceAll("day=", "-")
)
})
.toSeq
println(keys)
println(s"Fetched ${keys.size} ")
val myPartitions: Seq[Map[String, DateTime]] = keys.map(key => Map("businessdate" -> key))
myPartitions
}
val mapWithMostRecentBusinessDate = finalPartitions.sortWith(
(a, b) => a("businessdate").isAfter(b("businessdate"))
).head
println(mapWithMostRecentBusinessDate)
val latest: Option[DateTime] = mapWithMostRecentBusinessDate.get("businessdate")
val year = latest.get.getYear();
val month = latest.get.getMonthOfYear();
val day = latest.get.getDayOfMonth();
println("latest year "+ year + " latest month " + month + " latest day " + day)
最终结果:也就是说,您最近的日期是2019-02-27
,因此您可以通过优化的方式查询配置单元数据.
Final result : i.e. your most recent date is 2019-02-27
now based on this you can query hive data in an optimized way.
Identified Key: year=2019/month=1/day=22
Identified Key: year=2019/month=1/day=23
Identified Key: year=2019/month=1/day=24
Identified Key: year=2019/month=1/day=25
Identified Key: year=2019/month=1/day=26
Identified Key: year=2019/month=2/day=27
WrappedArray(2019-01-21T00:00:00.000-06:00, 2019-01-22T00:00:00.000-06:00, 2019-01-23T00:00:00.000-06:00, 2019-01-24T00:00:00.000-06:00, 2019-01-25T00:00:00.000-06:00, 2019-01-26T00:00:00.000-06:00, 2019-02-27T00:00:00.000-06:00)
Fetched 7
Map(businessdate -> 2019-02-27T00:00:00.000-06:00)
latest year 2019 latest month 2 latest day 27
这篇关于如何从年月日分区列的列表中提取最新/最近的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!