如何从年月日分区列列表中提取最新/最近的分区 [英] How to extract latest/recent partition from the list of year month day partition columns

查看:52
本文介绍了如何从年月日分区列列表中提取最新/最近的分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 spark sql 中使用了显示分区,它给了我以下内容:

I have used show partitions in spark sql which gives me the following:

year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27

  1. 我需要提取最新的分区
  2. 我需要分别指定年、月和日,以便我可以在另一个数据框中将其用作变量.即:

part_year=2019
part_month=1
part_day=29 

我用过:

val overwrite2 = overwrite.select(col("partition",8,8) as year

我从中得到

2019/month

为了删除它,我使用另一个数据框,其中我使用 regex_replace 将月份替换为空白,以便创建另一个数据框.

For removing this I use another dataframe where I use regex_replace to replace month with blank so another dataframe is created.

这反过来又会产生大量开销.我想要的是在一个数据帧中完成所有这些步骤,这样我就可以得到结果数据帧:

This is in turn creating a lot of overhead. What I want is for all these steps to be done in one dataframe so I can get the resultant dataframe as:

part_year=2019
part_month=2
part_day=27

选择最新的分区.

推荐答案

问题:如何从年月日列表中提取最新/最近的分区分区列

Question : How to extract latest/recent partition from the list of year month day partition columns

1) 我需要提取最新的分区.

1) I need to extract latest partition.

2) 我需要分别指定年、月和日,以便我可以在另一个数据框作为变量.

2) I need to the year, month and day separately so I can use it in another dataframe as variables.

  • 由于最终目标是获取最新/最近的分区...您可以使用 joda api DateTime 通过使用 isAfter 进行排序来获取最新的分区,如下例所示.
    • Since final goal is to get latest/recent partition... You can use joda api DateTime by sorting with isAfter to get latest partition like given as below example.
    • spark.sql(s"show Partitions $yourtablename") 之后你会得到一个数据框 collect,因为它的小数据没有问题.

      After spark.sql(s"show Partitions $yourtablename") you will get a dataframe collect that since its small data no issue.

      一旦你收集了数据帧分区,你就会得到一个这样的数组

      once you collect the dataframe partitions you will get an array like this

             val x = Array(
          "year=2019/month=1/day=21",
          "year=2019/month=1/day=22",
          "year=2019/month=1/day=23",
          "year=2019/month=1/day=24",
          "year=2019/month=1/day=25",
          "year=2019/month=1/day=26",
          "year=2019/month=2/day=27"
        )
        val finalPartitions = listKeys()
      
        import org.joda.time.DateTime
      
        def listKeys(): Seq[Map[String, DateTime]] = {
          val keys: Seq[DateTime] = x.map(row => {
            println(s" Identified Key: ${row.toString()}")
            DateTime.parse(row.replaceAll("/", "")
              .replaceAll("year=", "")
              .replaceAll("month=", "-")
              .replaceAll("day=", "-")
            )
          })
            .toSeq
          println(keys)
          println(s"Fetched ${keys.size} ")
          val myPartitions: Seq[Map[String, DateTime]] = keys.map(key => Map("businessdate" -> key))
      
          myPartitions
        }
        val mapWithMostRecentBusinessDate = finalPartitions.sortWith(
          (a, b) => a("businessdate").isAfter(b("businessdate"))
        ).head
      
        println(mapWithMostRecentBusinessDate)
        val latest: Option[DateTime] = mapWithMostRecentBusinessDate.get("businessdate")
        val year = latest.get.getYear();
        val month = latest.get.getMonthOfYear();
        val day = latest.get.getDayOfMonth();
        println("latest year "+ year + "  latest month " + month + "  latest day  " + day)
      

      最终结果:即您最近的日期是 2019-02-27 现在基于此您可以以优化的方式查询 hive 数据.

      Final result : i.e. your most recent date is 2019-02-27 now based on this you can query hive data in an optimized way.

       Identified Key: year=2019/month=1/day=22
       Identified Key: year=2019/month=1/day=23
       Identified Key: year=2019/month=1/day=24
       Identified Key: year=2019/month=1/day=25
       Identified Key: year=2019/month=1/day=26
       Identified Key: year=2019/month=2/day=27
      WrappedArray(2019-01-21T00:00:00.000-06:00, 2019-01-22T00:00:00.000-06:00, 2019-01-23T00:00:00.000-06:00, 2019-01-24T00:00:00.000-06:00, 2019-01-25T00:00:00.000-06:00, 2019-01-26T00:00:00.000-06:00, 2019-02-27T00:00:00.000-06:00)
      Fetched 7 
      Map(businessdate -> 2019-02-27T00:00:00.000-06:00)
      latest year 2019  latest month 2  latest day  27
      

      这篇关于如何从年月日分区列列表中提取最新/最近的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆