星火:集团RDD SQL查询 [英] Spark: Group RDD Sql Query
问题描述
我有3个RDDS,我需要加入。
VAL event1001RDD:schemaRDD = [事件类型,ID,位置,日期1]
[1001,4929102,LOC01,2015-01-20 10时44分39秒]
[1001,4929103,LOC02,2015-01-20 10时44分39秒]
[1001,4929104,LOC03,2015-01-20 10时44分39秒]
VAL event2009RDD:schemaRDD = [事件类型,编号,单元格类型,日期1](因为我视的celltype需要4日期从这个不是ID分组)
[2009,4929101,R01,2015-01-20 20时44分39秒]
[2009,4929102,R02,2015-01-20 14:00:00](RPM)的
[2009,4929102,P01,2015-01-20 12点零零分00秒](PPM)
[2009,4929102,R03,2015-01-20 15:00:00](RPM)的
[2009,4929102,C01,2015-01-20 13:00:00](RPM)的
[2009,4929103,R01,2015-01-20 14时44分39秒]
[2009,4929105,R01,2015-01-20 12时44分39秒]
[2009,4929105,V01,2015-01-20 11点44分39秒]
[2009,4929106,R01,2015-01-20 13时44分39秒]
VAL cellLookupRDD:[单元格类型,单元名称(单元名有4个值)
[R01,RPM]
[R02,RPM]
[R03,RPM]
[C01,RPM]
[P01,PPM]
[V01,PPM]
预期结果:[ID,1001位置,1001的date1,2009的第一个RPM日期,2009年的最后日期RPM,2009年的第一个PPM日期,2009年的最后PPM日期]
4929101,NULL,NULL,2015年1月20日20:44:39,NULL,NULL,NULL
4929102,LOC01,2015-01-20 10:44:39,2015-01-20 13:00:00,2015-01-20 15:00:00,2015-01-20 12:00:00,空
4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,NULL,NULL,NULL
4929104,LOC03,2015-01-20 10:44:39,NULL,NULL,NULL,NULL
4929105,NULL,NULL,2015年1月20日12:44:39,NULL,2015年1月20日11:44:39,空
4929106,NULL,NULL,2015年1月20日13:44:39,NULL,NULL,NULL
这是我当前的查询(这里我也表明一个可选的事件类型为第一列;但在我的previous event2009RDD,我选择一个最小和最大的日期是错误的,因为我需要通过确定的四个日期cellLookupRDD - RPM和PPM):
如果选择(event1001Table.eventtype不为空,event1001Table.eventtype,
event2009Table.eventtype)
如果(event1001Table.id不为空,event1001Table.id,
event2009Table.id)
event1001Table.date1,event2009Table.minDate,event2009Table.maxDate
从event1001Table全外连接event2009Table
在event1001Table.id = event2009Table.id)
编辑以显示应用答案后,结果是:
分钟(如果(l.cn ='转'或l.cn ='RPM2',r.date1,'NULL'))为rpmmn,+
MAX(IF(l.cn ='转'或l.cn ='RPM2',r.date1,'NULL'))为rpmmx,+
分钟(如果(l.cn ='PPM'或l.cn ='PPM2',r.date1,'NULL'))为ppmmn,+
MAX(IF(l.cn ='PPM'或l.cn ='PPM2',r.date1,'NULL'))为ppmmx+
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] MAX_RPM
---水库:[2009,4929102,NULL,NULL,2015年1月20日13:00:00,2015-01-20 14:00:00]
---正确[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,P01,2015-01-20 14:00:00] min_ppm
---水库:[2009,4929102,2015-01-20 13:00:00,NULL,2015年1月20日14:00:00,NULL]
---不正确(最大值应等于MIN虽然NULL是尽可能pferred $ P $,但我可以只签入code以后如果分=最大值)[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] MAX_RPM
[2009,4929102,P01,2015-01-20 09:00:00] min_ppm
---资源:[2009,4929102,2015-01-20 13:00:00,NULL,2015年1月20日09:00:00,NULL]
---不正确(最大不工作)
让工作一步一步来。让我们先建立部分2009年
event2009RDD.registerTempTable(base2009)
cellLookupRDD.registerTempTable(查找)trns2009 = ssc.sql(选择事件类型,ID \\
分钟(情况下,当l.cn ='转',那么r.date1否则返回null结束)rpmmn,\\
MAX(情况下,当l.cn ='转',那么r.date1否则返回null结束)rpmmx,\\
分钟(情况下,当l.cn ='PPM',那么r.date1否则返回null结束)ppmmn,\\
MAX(情况下,当l.cn ='PPM',那么r.date1否则返回null结束)ppmmx,\\
从base2009 - [R内部联接查找L于r.celltype = l.celltype \\
按事件类型,IDtrns2009 .registerTempTable(transformed2009)
现在你可以用1001的数据集完全外部联接,并得到输出。
请注意:你不应该有
4929101,NULL,NULL,2015年1月20日20:44:39,NULL,NULL,NULL
相反,你应该有
4929101,NULL,NULL,2015年1月20日20:44:39,2015-01-20 20:44:39,NULL,NULL
由于,如果发生了一次2009年的事件,那么它应该有两个第一和最后日期。 NULL应该重新present事件从未发生过,像ID = 4929101,单元格类型= PPM。
请让我知道,如果这个工程(或没有)。我没有访问到右spark此时刻,但应该能够调试,如果需要的话,今晚。
I have 3 RDDs that I need to join.
val event1001RDD: schemaRDD = [eventtype,id,location,date1]
[1001,4929102,LOC01,2015-01-20 10:44:39]
[1001,4929103,LOC02,2015-01-20 10:44:39]
[1001,4929104,LOC03,2015-01-20 10:44:39]
val event2009RDD: schemaRDD = [eventtype,id,celltype,date1] (not grouped by id since I need 4 dates from this depending on celltype)
[2009,4929101,R01,2015-01-20 20:44:39]
[2009,4929102,R02,2015-01-20 14:00:00] (RPM)
[2009,4929102,P01,2015-01-20 12:00:00] (PPM)
[2009,4929102,R03,2015-01-20 15:00:00] (RPM)
[2009,4929102,C01,2015-01-20 13:00:00] (RPM)
[2009,4929103,R01,2015-01-20 14:44:39]
[2009,4929105,R01,2015-01-20 12:44:39]
[2009,4929105,V01,2015-01-20 11:44:39]
[2009,4929106,R01,2015-01-20 13:44:39]
val cellLookupRDD: [celltype, cellname] (cellname has 4 values)
[R01,RPM]
[R02,RPM]
[R03,RPM]
[C01,RPM]
[P01,PPM]
[V01,PPM]
Expected result: [id,1001's location,1001's date1,2009's first RPM date,2009's last RPM date, 2009's first PPM date,2009's last PPM date]
4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL
4929102,LOC01,2015-01-20 10:44:39,2015-01-20 13:00:00,2015-01-20 15:00:00,2015-01-20 12:00:00,NULL
4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,NULL,NULL,NULL
4929104,LOC03,2015-01-20 10:44:39,NULL,NULL,NULL,NULL
4929105,NULL,NULL,2015-01-20 12:44:39,NULL,2015-01-20 11:44:39,NULL
4929106,NULL,NULL,2015-01-20 13:44:39,NULL,NULL,NULL
This is my current query (where I am also indicating an optional eventtype as the first column; but in my previous event2009RDD, I am selecting a min and max date which is wrong since I need the four dates determined via cellLookupRDD - RPM and PPM):
select if(event1001Table.eventtype is not null, event1001Table.eventtype,
event2009Table.eventtype),
if(event1001Table.id is not null, event1001Table.id,
event2009Table.id),
event1001Table.date1, event2009Table.minDate, event2009Table.maxDate
from event1001Table full outer join event2009Table
on event1001Table.id=event2009Table.id")
EDITED to show result after applying answer:
" min(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmn, " +
" max(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmx, " +
" min(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmn, " +
" max(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmx " +
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
---res: [2009,4929102,NULL,NULL,2015-01-20 13:00:00,2015-01-20 14:00:00]
--- CORRECT
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,P01,2015-01-20 14:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 14:00:00,NULL]
--- INCORRECT (max should be equal to MIN although NULL is preferred if possible but I could just check in the code later on if min=max)
[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
[2009,4929102,P01,2015-01-20 09:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 09:00:00,NULL]
--- INCORRECT (max is not working)
Lets work it step by step. Lets first construct the 2009 part
event2009RDD.registerTempTable("base2009")
cellLookupRDD.registerTempTable("lookup")
trns2009 = ssc.sql("select eventtype, id, \
min(case when l.cn = 'RPM' then r.date1 else null end) rpmmn, \
max(case when l.cn = 'RPM' then r.date1 else null end) rpmmx, \
min(case when l.cn = 'PPM' then r.date1 else null end) ppmmn, \
max(case when l.cn = 'PPM' then r.date1 else null end) ppmmx, \
from base2009 r inner join lookup l on r.celltype=l.celltype \
group by eventtype,id "
trns2009 .registerTempTable("transformed2009")
Now you can do a full outer join with 1001 data set and get output.
Note: you should not have
4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL
instead, you should have
4929101,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39,NULL,NULL
Because, if 2009 event have occurred once, then it should have both first and last date. NULL should represent an event never occurred, like for id=4929101, celltype=PPM.
Please let me know if this works (or not). I do not have access to spark right this moment, but should be able to debug, if needed, tonight.
这篇关于星火:集团RDD SQL查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!