修复查询以解决 Scala 数据块 2.4.3 中的 to_char 和/或字符串比较问题 [英] Fix query to resolve to_char and or string comparison issue in scala databricks 2.4.3
问题描述
我已经处理了镶木地板文件并在 scala spark 2.4.3 中创建了以下数据框.
I've processed parquet file and created the following data frame in scala spark 2.4.3.
+-----------+------------+-----------+--------------+-----------+
| itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
| 7512365| 2| 2014| 110.0| 2014-02-01|
| 7519278| 4| 2013| 96.0| 2013-04-01|
|5436134-070| 12| 2013| 8.0| 2013-12-01|
| 7547385| 1| 2014| 89.0| 2014-01-01|
| 0453978| 9| 2014| 18.0| 2014-09-01|
| 7558402| 10| 2014| 260.0| 2014-10-01|
|5437662-070| 7| 2013| 78.0| 2013-07-01|
| 3089858| 11| 2014| 5.0| 2014-11-01|
| 7181584| 2| 2017| 4.0| 2017-02-01|
| 7081417| 3| 2017| 15.0| 2017-03-01|
| 5814215| 4| 2017| 35.0| 2017-04-01|
| 7178940| 10| 2014| 5.0| 2014-10-01|
| 0450636| 1| 2015| 7.0| 2015-01-01|
| 5133406| 5| 2014| 46.0| 2014-05-01|
| 2204858| 12| 2015| 34.0| 2015-12-01|
| 1824299| 5| 2015| 1.0| 2015-05-01|
|5437474-620| 8| 2015| 4.0| 2015-08-01|
| 3086317| 9| 2014| 1.0| 2014-09-01|
| 2204331| 3| 2015| 2.0| 2015-03-01|
| 5334160| 1| 2018| 2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+
为了派生一个新功能,我正在尝试应用逻辑并重新排列数据框,如下所示
To derive a new feature, I am trying to apply logic and rearrange data frame as following
itemno – as it is in above-mentioned data frame
startDate - the start of the season
endDate - the end of the season
totalRequested - number of parts requested in that season
percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)
//seasons date for reference
Spring: 1 March to 31 May.
Summer: 1 June to 31 August.
Autumn: 1 September to 30 November.
Winter: 1 December to 28 February.
我做了什么:
我尝试遵循两种逻辑
case
when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
.....
.....
end as season
但是没有用.我在 oracle DB 中做了 to_char 逻辑,它在那里工作,但环顾四周后,我发现 spark SQL 没有这个功能.另外,我试过
but it didn't work. I did to_char logic in oracle DB and it worked there but after looking around, I found spark SQL doesn't have this function. Also, I tried
import org.apache.spark.sql.functions._
val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))
%sql
select distinct requestDate, MMDD,
case
when MMDD between '3-1' and '5-31' then 'Spring'
when MMDD between '6-1' and '8-31' then 'Summer'
when MMDD between '9-1' and '11-30' then 'Autumn'
when MMDD between '12-1' and '2-28' then 'Winter'
end as season
from temporal
它也不起作用.你能否让我知道我在这里遗漏了什么(我猜我无法比较这样的字符串,但我不确定所以我在这里问)以及我如何解决这个问题?
and it also didn't work. Could you please let me know what I am missing here (my guess is I can't compare strings like this but I am not sure so I asked here) and how I can solve this?
JXC 解决方案#1 之后
由于我看到了一些差异,我再次共享数据框.以下是数据帧 seasonDF12
Since I was seeing some dicrepancy, I am sharing data frame again. Following is the dataframe seasonDF12
+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Winter| 201175|
|0450000| 2012| 3.0|Winter| 201175|
|0450000| 2012| 1.0|Winter| 201175|
|0450000| 2012| 4.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 2.0|Summer| 201225|
|0450000| 2012| 3.0|Summer| 201225|
|0450000| 2012| 2.0|Summer| 201225|
+-------+-----------+--------------+------+----------+
我要申请
val seasonDF2 = seasonDF12.selectExpr("*", """
sum(totalRequested) OVER (
PARTITION BY itemno
ORDER BY seasonCalc
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
) AS sum_totalRequested
""")
我看到了
查看 sum_totalRequested 列
中的前 40 个.它上面的所有条目都是 0.不知道为什么是 40.我想我已经共享了它,但我需要将上面的数据框转换为
look at first 40 in sum_totalRequested column
. All the entries above it are 0. Not sure why it's 40. I think I already shared it but I need above dataframe to be transformed in to
itemno startDateOfSeason endDateOfSeason sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
最终输出如下:
itemno startDateOfSeason endDateOfSeason season sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123 12/01/2018 02/28/2019 winter 12 12/12+ 36 (36 from previous three seasons)
123 03/01/2019 05/31/2019 spring 24 24/24 + 45 (45 from previous three seasons)
推荐答案
Edit-2:调整为先计算groupby seasons的总和,然后是Window的聚合总和:
Edit-2: adjusted to calculate the sum groupby seasons first and then the Window aggregate sum:
Edit-1:根据评论,不需要命名季节.我们可以设置Spring
、Summer
、Autumn
、Winter
为0
、25
, 50
和 75
分别和季节将是 year(requestDate)*100
相加的整数,这样我们可以在窗口聚合函数中使用 rangeBetween(当前 + 前 3 个赛季的偏移=-100):
Edit-1: Based on the comments, the named season is not required. we can set Spring
, Summer
, Autumn
, Winter
as 0
, 25
, 50
and 75
respectively and the season will be an integer added up by year(requestDate)*100
so that we can use rangeBetween (offset=-100 for current + the previous 3 seasons) in Window aggregate functions:
注意:下面是pyspark代码:
df.createOrReplaceTempView("df_table")
df1 = spark.sql("""
WITH t1 AS ( SELECT *
, year(requestDate) as YY
, date_format(requestDate, "MMdd") as MMDD
FROM df_table )
, t2 AS ( SELECT *,
CASE
WHEN MMDD BETWEEN '0301' AND '0531' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '03-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '05-31'))
, 'season', 'spring'
, 'label', int(YY)*100
)
WHEN MMDD BETWEEN '0601' AND '0831' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '06-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '08-31'))
, 'season', 'summer'
, 'label', int(YY)*100 + 25
)
WHEN MMDD BETWEEN '0901' AND '1130' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '09-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '11-30'))
, 'season', 'autumn'
, 'label', int(YY)*100 + 50
)
WHEN MMDD BETWEEN '1201' AND '1231' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', int(YY)+1, '02-28'))
, 'season', 'winter'
, 'label', int(YY)*100 + 75
)
WHEN MMDD BETWEEN '0101' AND '0229' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', int(YY)-1, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', YY, '02-28'))
, 'season', 'winter'
, 'label', (int(YY)-1)*100 + 75
)
END AS seasons
FROM t1
)
SELECT itemno
, seasons.*
, sum(totalRequested) AS sum_totalRequestedBySeason
FROM t2
GROUP BY itemno, seasons
""")
这将得到以下结果:
df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
| itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070| 2013-12-01| 2013-12-31|winter|201375| 8.0|
| 1824299| 2015-03-01| 2015-05-31|spring|201500| 1.0|
| 0453978| 2014-09-01| 2014-11-30|autumn|201450| 18.0|
| 7181584| 2017-01-01| 2017-02-28|winter|201675| 4.0|
| 7178940| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7547385| 2014-01-01| 2014-02-28|winter|201375| 89.0|
| 5814215| 2017-03-01| 2017-05-31|spring|201700| 35.0|
| 3086317| 2014-09-01| 2014-11-30|autumn|201450| 1.0|
| 0450636| 2015-01-01| 2015-02-28|winter|201475| 7.0|
| 2204331| 2015-03-01| 2015-05-31|spring|201500| 2.0|
|5437474-620| 2015-06-01| 2015-08-31|summer|201525| 4.0|
| 5133406| 2014-03-01| 2014-05-31|spring|201400| 46.0|
| 7081417| 2017-03-01| 2017-05-31|spring|201700| 15.0|
| 7519278| 2013-03-01| 2013-05-31|spring|201300| 96.0|
| 7558402| 2014-09-01| 2014-11-30|autumn|201450| 260.0|
| 2204858| 2015-12-01| 2015-12-31|winter|201575| 34.0|
|5437662-070| 2013-06-01| 2013-08-31|summer|201325| 78.0|
| 5334160| 2018-01-01| 2018-02-28|winter|201775| 2.0|
| 3089858| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7512365| 2014-01-01| 2014-02-28|winter|201375| 110.0|
+-----------+-----------------+---------------+------+------+--------------------------+
在我们得到赛季总数后,然后使用窗口聚合函数计算当前加上前 3 个赛季的总和,然后是比率:
After we have the season totals, then calculate the sum of the current plus previous 3 seasons using Window aggregate function and then the ratio:
df1.selectExpr("*", """
round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (
PARTITION BY itemno
ORDER BY label
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
),2) AS ratio_of_current_over_current_plus_past_3_seasons
""").show()
这篇关于修复查询以解决 Scala 数据块 2.4.3 中的 to_char 和/或字符串比较问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!