修复查询以解决Scala Databricks 2.4.3中的to_char和/或字符串比较问题 [英] Fix query to resolve to_char and or string comparison issue in scala databricks 2.4.3

查看:62
本文介绍了修复查询以解决Scala Databricks 2.4.3中的to_char和/或字符串比较问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经处理了镶木地板文件,并在scala spark 2.4.3中创建了以下数据框.

I've processed parquet file and created the following data frame in scala spark 2.4.3.

+-----------+------------+-----------+--------------+-----------+
|     itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
|    7512365|           2|       2014|         110.0| 2014-02-01|
|    7519278|           4|       2013|          96.0| 2013-04-01|
|5436134-070|          12|       2013|           8.0| 2013-12-01|
|    7547385|           1|       2014|          89.0| 2014-01-01|
|    0453978|           9|       2014|          18.0| 2014-09-01|
|    7558402|          10|       2014|         260.0| 2014-10-01|
|5437662-070|           7|       2013|          78.0| 2013-07-01|
|    3089858|          11|       2014|           5.0| 2014-11-01|
|    7181584|           2|       2017|           4.0| 2017-02-01|
|    7081417|           3|       2017|          15.0| 2017-03-01|
|    5814215|           4|       2017|          35.0| 2017-04-01|
|    7178940|          10|       2014|           5.0| 2014-10-01|
|    0450636|           1|       2015|           7.0| 2015-01-01|
|    5133406|           5|       2014|          46.0| 2014-05-01|
|    2204858|          12|       2015|          34.0| 2015-12-01|
|    1824299|           5|       2015|           1.0| 2015-05-01|
|5437474-620|           8|       2015|           4.0| 2015-08-01|
|    3086317|           9|       2014|           1.0| 2014-09-01|
|    2204331|           3|       2015|           2.0| 2015-03-01|
|    5334160|           1|       2018|           2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+

要派生新功能,我尝试如下应用逻辑并重新排列数据帧

To derive a new feature, I am trying to apply logic and rearrange data frame as following

itemno – as it is in above-mentioned data frame

startDate - the start of the season

endDate - the end of the season

totalRequested - number of parts requested in that season

percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)

//seasons date for reference
Spring: 1 March to 31 May.

Summer: 1 June to 31 August.

Autumn: 1 September to 30 November.

Winter: 1 December to 28 February.

我做什么:

我尝试遵循两种逻辑

case 
            when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
           .....
           .....
           end as season

但是没有用.我在oracle DB中执行了to_char逻辑,并且在那里工作了,但是环顾四周之后,我发现spark SQL没有此功能.另外,我尝试了

but it didn't work. I did to_char logic in oracle DB and it worked there but after looking around, I found spark SQL doesn't have this function. Also, I tried

import org.apache.spark.sql.functions._

val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))

%sql
select distinct requestDate, MMDD, 
case 
           when MMDD between '3-1' and '5-31' then 'Spring' 
           when MMDD between '6-1' and '8-31' then 'Summer' 
           when MMDD between '9-1' and '11-30' then 'Autumn' 
           when MMDD between '12-1' and '2-28' then 'Winter'
 end as season
from temporal

,它也不起作用.您可以让我知道我在这里丢失了什么吗(我猜我无法比较这样的字符串,但是我不确定,所以我在这里问了这个问题)以及如何解决这个问题?

and it also didn't work. Could you please let me know what I am missing here (my guess is I can't compare strings like this but I am not sure so I asked here) and how I can solve this?

JXC解决方案#1之后,范围介于

由于我看到了一些差异,因此我再次共享数据帧.以下是数据框seasonDF12

Since I was seeing some dicrepancy, I am sharing data frame again. Following is the dataframe seasonDF12

+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Winter|    201075|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Spring|    201100|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Summer|    201125|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Autumn|    201150|
|0450000|       2011|           0.0|Winter|    201175|
|0450000|       2012|           3.0|Winter|    201175|
|0450000|       2012|           1.0|Winter|    201175|
|0450000|       2012|           4.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           0.0|Spring|    201200|
|0450000|       2012|           2.0|Summer|    201225|
|0450000|       2012|           3.0|Summer|    201225|
|0450000|       2012|           2.0|Summer|    201225|
+-------+-----------+--------------+------+----------+

我要申请的

val seasonDF2 = seasonDF12.selectExpr("*", """
                                      sum(totalRequested) OVER (
                                          PARTITION BY itemno
                                          ORDER BY seasonCalc
                                          RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
                                      ) AS sum_totalRequested

                                   """)

我正在看到

查看sum_totalRequested column中的前40个.它上面的所有条目都是0.不确定为什么是40.我想我已经共享了它,但是我需要上面的dataframe转换成

look at first 40 in sum_totalRequested column. All the entries above it are 0. Not sure why it's 40. I think I already shared it but I need above dataframe to be transformed in to

itemno  startDateOfSeason  endDateOfSeason   sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)

最终输出将如下所示:

itemno  startDateOfSeason  endDateOfSeason     season      sum_totalRequestedBySeason  (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123     12/01/2018          02/28/2019         winter       12                          12/12+ 36 (36 from previous three seasons)
123     03/01/2019          05/31/2019         spring       24                          24/24 + 45 (45 from previous three seasons)

推荐答案

Edit-2 :已调整为首先计算按季节分组的总和,然后计算Window的总和:

Edit-2: adjusted to calculate the sum groupby seasons first and then the Window aggregate sum:

编辑1 :根据评论,不需要指定的季节.我们可以分别将SpringSummerAutumnWinter设置为0255075,并且季节将是year(requestDate)*100加起来的整数.我们可以在Window聚合函数中使用rangeBetween(当前+前三个季节的偏移量为-100)

Edit-1: Based on the comments, the named season is not required. we can set Spring, Summer, Autumn, Winter as 0, 25, 50 and 75 respectively and the season will be an integer added up by year(requestDate)*100 so that we can use rangeBetween (offset=-100 for current + the previous 3 seasons) in Window aggregate functions:

注意:以下是 pyspark 代码:

df.createOrReplaceTempView("df_table")

df1 = spark.sql("""
    WITH t1 AS ( SELECT *
                 , year(requestDate) as YY
                 , date_format(requestDate, "MMdd") as MMDD 
                 FROM df_table )
       , t2 AS ( SELECT *,
       CASE
         WHEN MMDD BETWEEN '0301' AND '0531' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '03-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '05-31'))
             , 'season', 'spring'
             , 'label', int(YY)*100
           )
         WHEN MMDD BETWEEN '0601' AND '0831' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '06-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '08-31'))
             , 'season', 'summer'
             , 'label', int(YY)*100 + 25
           )
         WHEN MMDD BETWEEN '0901' AND '1130' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '09-01'))
             , 'endDateOfSeason', date(concat_ws('-', YY, '11-30'))
             , 'season', 'autumn'
             , 'label', int(YY)*100 + 50
           )
         WHEN MMDD BETWEEN '1201' AND '1231' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', YY, '12-01'))
             , 'endDateOfSeason', last_day(concat_ws('-', int(YY)+1, '02-28'))
             , 'season', 'winter'
             , 'label', int(YY)*100 + 75
           )
         WHEN MMDD BETWEEN '0101' AND '0229' THEN
           named_struct(
               'startDateOfSeason', date(concat_ws('-', int(YY)-1, '12-01'))
             , 'endDateOfSeason', last_day(concat_ws('-', YY, '02-28'))
             , 'season', 'winter'
             , 'label', (int(YY)-1)*100 + 75
           )
       END AS seasons
       FROM t1
  )
  SELECT itemno
  ,      seasons.*
  ,      sum(totalRequested) AS sum_totalRequestedBySeason
  FROM t2
  GROUP BY itemno, seasons

   """)

这将得到以下结果:

df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
|     itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070|       2013-12-01|     2013-12-31|winter|201375|                       8.0|
|    1824299|       2015-03-01|     2015-05-31|spring|201500|                       1.0|
|    0453978|       2014-09-01|     2014-11-30|autumn|201450|                      18.0|
|    7181584|       2017-01-01|     2017-02-28|winter|201675|                       4.0|
|    7178940|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7547385|       2014-01-01|     2014-02-28|winter|201375|                      89.0|
|    5814215|       2017-03-01|     2017-05-31|spring|201700|                      35.0|
|    3086317|       2014-09-01|     2014-11-30|autumn|201450|                       1.0|
|    0450636|       2015-01-01|     2015-02-28|winter|201475|                       7.0|
|    2204331|       2015-03-01|     2015-05-31|spring|201500|                       2.0|
|5437474-620|       2015-06-01|     2015-08-31|summer|201525|                       4.0|
|    5133406|       2014-03-01|     2014-05-31|spring|201400|                      46.0|
|    7081417|       2017-03-01|     2017-05-31|spring|201700|                      15.0|
|    7519278|       2013-03-01|     2013-05-31|spring|201300|                      96.0|
|    7558402|       2014-09-01|     2014-11-30|autumn|201450|                     260.0|
|    2204858|       2015-12-01|     2015-12-31|winter|201575|                      34.0|
|5437662-070|       2013-06-01|     2013-08-31|summer|201325|                      78.0|
|    5334160|       2018-01-01|     2018-02-28|winter|201775|                       2.0|
|    3089858|       2014-09-01|     2014-11-30|autumn|201450|                       5.0|
|    7512365|       2014-01-01|     2014-02-28|winter|201375|                     110.0|
+-----------+-----------------+---------------+------+------+--------------------------+

获得季节总计后,然后使用Window聚合函数然后计算比率,计算当前加上前3个季节的总和:

After we have the season totals, then calculate the sum of the current plus previous 3 seasons using Window aggregate function and then the ratio:

df1.selectExpr("*", """

    round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (         
        PARTITION BY itemno         
        ORDER BY label         
        RANGE BETWEEN 100 PRECEDING AND CURRENT ROW         
    ),2) AS ratio_of_current_over_current_plus_past_3_seasons

""").show()

这篇关于修复查询以解决Scala Databricks 2.4.3中的to_char和/或字符串比较问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆