修复查询以解决 Scala 数据块 2.4.3 中的 to_char 和/或字符串比较问题 [英] Fix query to resolve to_char and or string comparison issue in scala databricks 2.4.3

查看:35
本文介绍了修复查询以解决 Scala 数据块 2.4.3 中的 to_char 和/或字符串比较问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经处理了镶木地板文件并在 scala spark 2.4.3 中创建了以下数据框.

I've processed parquet file and created the following data frame in scala spark 2.4.3.

+-----------+------------+-----------+--------------+-----------+
|     itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
|    7512365|           2|       2014|         110.0| 2014-02-01|
|    7519278|           4|       2013|          96.0| 2013-04-01|
|5436134-070|          12|       2013|           8.0| 2013-12-01|
|    7547385|           1|       2014|          89.0| 2014-01-01|
|    0453978|           9|       2014|          18.0| 2014-09-01|
|    7558402|          10|       2014|         260.0| 2014-10-01|
|5437662-070|           7|       2013|          78.0| 2013-07-01|
|    3089858|          11|       2014|           5.0| 2014-11-01|
|    7181584|           2|       2017|           4.0| 2017-02-01|
|    7081417|           3|       2017|          15.0| 2017-03-01|
|    5814215|           4|       2017|          35.0| 2017-04-01|
|    7178940|          10|       2014|           5.0| 2014-10-01|
|    0450636|           1|       2015|           7.0| 2015-01-01|
|    5133406|           5|       2014|          46.0| 2014-05-01|
|    2204858|          12|       2015|          34.0| 2015-12-01|
|    1824299|           5|       2015|           1.0| 2015-05-01|
|5437474-620|           8|       2015|           4.0| 2015-08-01|
|    3086317|           9|       2014|           1.0| 2014-09-01|
|    2204331|           3|       2015|           2.0| 2015-03-01|
|    5334160|           1|       2018|           2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+

为了派生一个新功能,我正在尝试应用逻辑并重新排列数据框,如下所示

To derive a new feature, I am trying to apply logic and rearrange data frame as following

itemno – as it is in above-mentioned data frame

startDate - the start of the season

endDate - the end of the season

totalRequested - number of parts requested in that season

percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)

//seasons date for reference
Spring: 1 March to 31 May.

Summer: 1 June to 31 August.

Autumn: 1 September to 30 November.

Winter: 1 December to 28 February.

我做了什么:

我尝试遵循两种逻辑

case 
            when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
           .....
           .....
           end as season

但是没有用.我在 oracle DB 中做了 to_char 逻辑,它在那里工作,但环顾四周后,我发现 spark SQL 没有这个功能.另外,我试过

but it didn't work. I did to_char logic in oracle DB and it worked there but after looking around, I found spark SQL doesn't have this function. Also, I tried

import org.apache.spark.sql.functions._

val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))

%sql
select distinct requestDate, MMDD, 
case 
           when MMDD between '3-1' and '5-31' then 'Spring' 
           when MMDD between '6-1' and '8-31' then 'Summer' 
           when MMDD between '9-1' and '11-30' then 'Autumn' 
           when MMDD between '12-1' and '2-28' then 'Winter'
 end as season
from temporal

它也不起作用.你能否让我知道我在这里遗漏了什么(我猜我无法比较这样的字符串,但我不确定所以我在这里问)以及我如何解决这个问题?

and it also didn't work. Could you please let me know what I am missing here (my guess is I can't compare strings like this but I am not sure so I asked here) and how I can solve this?

JXC 解决方案#1 之后

由于我看到了一些差异,我再次共享数据框.以下是数据帧 seasonDF12

Since I was seeing some dicrepancy, I am sharing data frame again. Following is the dataframe seasonDF12

+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Winter|    201175|
|0450000|       2012|           3.0|Winter|    201175|
|0450000|       2012|           1.0|Winter|    201175|
|0450000|       2012|           4.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           2.0|Summer|    201225|
|0450000|       2012|           3.0|Summer|    201225|
|0450000|       2012|           2.0|Summer|    201225|
+-------+-----------+--------------+------+----------+

我要申请

val seasonDF2 = seasonDF12.selectExpr("*", """
                                      sum(totalRequested) OVER (
                                          PARTITION BY itemno
                                          ORDER BY seasonCalc
                                          RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
                                      ) AS sum_totalRequested

                                   """)

我看到了

查看 sum_totalRequested 列 中的前 40 个.它上面的所有条目都是 0.不知道为什么是 40.我想我已经共享了它,但我需要将上面的数据框转换为

look at first 40 in sum_totalRequested column. All the entries above it are 0. Not sure why it's 40. I think I already shared it but I need above dataframe to be transformed in to

itemno  startDateOfSeason  endDateOfSeason   sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)

最终输出如下:

itemno  startDateOfSeason  endDateOfSeason     season      sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123     12/01/2018          02/28/2019         winter       12                          12/12+ 36 (36 from previous three seasons)
123     03/01/2019          05/31/2019         spring       24                          24/24 + 45 (45 from previous three seasons)

推荐答案

Edit-2:调整为先计算groupby seasons的总和,然后是Window的聚合总和:

Edit-2: adjusted to calculate the sum groupby seasons first and then the Window aggregate sum:

Edit-1:根据评论,不需要命名季节.我们可以设置SpringSummerAutumnWinter025, 5075 分别和季节将是 year(requestDate)*100 相加的整数,这样我们可以在窗口聚合函数中使用 rangeBetween(当前 + 前 3 个赛季的偏移=-100):

Edit-1: Based on the comments, the named season is not required. we can set Spring, Summer, Autumn, Winter as 0, 25, 50 and 75 respectively and the season will be an integer added up by year(requestDate)*100 so that we can use rangeBetween (offset=-100 for current + the previous 3 seasons) in Window aggregate functions:

注意:下面是pyspark代码:

df.createOrReplaceTempView("df_table")

df1 = spark.sql("""
    WITH t1 AS ( SELECT *
                 , year(requestDate) as YY
                 , date_format(requestDate, "MMdd") as MMDD 
                 FROM df_table )
       , t2 AS ( SELECT *,
       CASE
         WHEN MMDD BETWEEN '0301' AND '0531' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '03-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '05-31'))
             , 'season', 'spring'
             , 'label', int(YY)*100
           )
         WHEN MMDD BETWEEN '0601' AND '0831' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '06-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '08-31'))
             , 'season', 'summer'
             , 'label', int(YY)*100 + 25
           )
         WHEN MMDD BETWEEN '0901' AND '1130' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '09-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '11-30'))
             , 'season', 'autumn'
             , 'label', int(YY)*100 + 50
           )
         WHEN MMDD BETWEEN '1201' AND '1231' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '12-01'))
             , 'endDateOfSeason', last_day(concat_ws('-', int(YY)+1, '02-28'))
             , 'season', 'winter'
             , 'label', int(YY)*100 + 75
           )
         WHEN MMDD BETWEEN '0101' AND '0229' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', int(YY)-1, '12-01'))
             , 'endDateOfSeason', last_day(concat_ws('-', YY, '02-28'))
             , 'season', 'winter'
             , 'label', (int(YY)-1)*100 + 75
           )
       END AS seasons
       FROM t1
  )
  SELECT itemno
  ,      seasons.*
  ,      sum(totalRequested) AS sum_totalRequestedBySeason
  FROM t2
  GROUP BY itemno, seasons

   """)

这将得到以下结果:

df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
|     itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070|       2013-12-01|     2013-12-31|winter|201375|                       8.0|
|    1824299|       2015-03-01|     2015-05-31|spring|201500|                       1.0|
|    0453978|       2014-09-01|     2014-11-30|autumn|201450|                      18.0|
|    7181584|       2017-01-01|     2017-02-28|winter|201675|                       4.0|
|    7178940|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7547385|       2014-01-01|     2014-02-28|winter|201375|                      89.0|
|    5814215|       2017-03-01|     2017-05-31|spring|201700|                      35.0|
|    3086317|       2014-09-01|     2014-11-30|autumn|201450|                       1.0|
|    0450636|       2015-01-01|     2015-02-28|winter|201475|                       7.0|
|    2204331|       2015-03-01|     2015-05-31|spring|201500|                       2.0|
|5437474-620|       2015-06-01|     2015-08-31|summer|201525|                       4.0|
|    5133406|       2014-03-01|     2014-05-31|spring|201400|                      46.0|
|    7081417|       2017-03-01|     2017-05-31|spring|201700|                      15.0|
|    7519278|       2013-03-01|     2013-05-31|spring|201300|                      96.0|
|    7558402|       2014-09-01|     2014-11-30|autumn|201450|                     260.0|
|    2204858|       2015-12-01|     2015-12-31|winter|201575|                      34.0|
|5437662-070|       2013-06-01|     2013-08-31|summer|201325|                      78.0|
|    5334160|       2018-01-01|     2018-02-28|winter|201775|                       2.0|
|    3089858|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7512365|       2014-01-01|     2014-02-28|winter|201375|                     110.0|
+-----------+-----------------+---------------+------+------+--------------------------+

在我们得到赛季总数后,然后使用窗口聚合函数计算当前加上前 3 个赛季的总和,然后是比率:

After we have the season totals, then calculate the sum of the current plus previous 3 seasons using Window aggregate function and then the ratio:

df1.selectExpr("*", """

    round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (         
        PARTITION BY itemno         
        ORDER BY label         
        RANGE BETWEEN 100 PRECEDING AND CURRENT ROW         
    ),2) AS ratio_of_current_over_current_plus_past_3_seasons

""").show()

这篇关于修复查询以解决 Scala 数据块 2.4.3 中的 to_char 和/或字符串比较问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆