Spark:分组RDD Sql查询 [英] Spark: Group RDD Sql Query

查看:35
本文介绍了Spark:分组RDD Sql查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要加入 3 个 RDD.

I have 3 RDDs that I need to join.

val event1001RDD: schemaRDD = [eventtype,id,location,date1]

val event1001RDD: schemaRDD = [eventtype,id,location,date1]

[1001,4929102,LOC01,2015-01-20 10:44:39]
[1001,4929103,LOC02,2015-01-20 10:44:39]
[1001,4929104,LOC03,2015-01-20 10:44:39]

val event2009RDD: schemaRDD = [eventtype,id,celltype,date1](不按 id 分组,因为我需要 4 个日期,具体取决于 celltype)

val event2009RDD: schemaRDD = [eventtype,id,celltype,date1] (not grouped by id since I need 4 dates from this depending on celltype)

[2009,4929101,R01,2015-01-20 20:44:39]
[2009,4929102,R02,2015-01-20 14:00:00] (RPM)
[2009,4929102,P01,2015-01-20 12:00:00] (PPM)
[2009,4929102,R03,2015-01-20 15:00:00] (RPM)
[2009,4929102,C01,2015-01-20 13:00:00] (RPM)
[2009,4929103,R01,2015-01-20 14:44:39]
[2009,4929105,R01,2015-01-20 12:44:39]
[2009,4929105,V01,2015-01-20 11:44:39]
[2009,4929106,R01,2015-01-20 13:44:39]

val cellLookupRDD: [celltype, cellname] (cellname 有 4 个值)

val cellLookupRDD: [celltype, cellname] (cellname has 4 values)

[R01,RPM]
[R02,RPM]
[R03,RPM]
[C01,RPM]
[P01,PPM]
[V01,PPM]

预期结果:[id,1001's location,1001's date1,2009's first RPM date,2009's last RPM date,2009's first PPM date,2009's last PPM date]

Expected result: [id,1001's location,1001's date1,2009's first RPM date,2009's last RPM date, 2009's first PPM date,2009's last PPM date]

4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL
4929102,LOC01,2015-01-20 10:44:39,2015-01-20 13:00:00,2015-01-20 15:00:00,2015-01-20 12:00:00,NULL
4929103,LOC02,2015-01-20 10:44:39,2015-01-20 14:44:39,NULL,NULL,NULL
4929104,LOC03,2015-01-20 10:44:39,NULL,NULL,NULL,NULL
4929105,NULL,NULL,2015-01-20 12:44:39,NULL,2015-01-20 11:44:39,NULL
4929106,NULL,NULL,2015-01-20 13:44:39,NULL,NULL,NULL

这是我当前的查询(其中我还将可选事件类型指示为第一列;但在我之前的 event2009RDD 中,我选择了一个最小和最大日期,这是错误的,因为我需要通过 cellLookupRDD - RPM 确定的四个日期和 PPM):

This is my current query (where I am also indicating an optional eventtype as the first column; but in my previous event2009RDD, I am selecting a min and max date which is wrong since I need the four dates determined via cellLookupRDD - RPM and PPM):

select if(event1001Table.eventtype is not null, event1001Table.eventtype,
          event2009Table.eventtype), 
       if(event1001Table.id is not null, event1001Table.id, 
          event2009Table.id), 
       event1001Table.date1, event2009Table.minDate, event2009Table.maxDate  
       from event1001Table full outer join event2009Table  
       on event1001Table.id=event2009Table.id")

<小时>

编辑以显示应用答案后的结果:


EDITED to show result after applying answer:

  " min(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmn, " +
  " max(if(l.cn = 'RPM' or l.cn = 'RPM2', r.date1, 'NULL')) as rpmmx, " +
  " min(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmn, " +
  " max(if(l.cn = 'PPM' or l.cn = 'PPM2', r.date1, 'NULL')) as ppmmx " +


[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
---res: [2009,4929102,NULL,NULL,2015-01-20 13:00:00,2015-01-20 14:00:00] 
--- CORRECT

[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,P01,2015-01-20 14:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 14:00:00,NULL] 
--- INCORRECT (max should be equal to MIN although NULL is preferred if possible but I could just check in the code later on if min=max)

[2009,4929102,R01,2015-01-20 13:00:00] min_rpm
[2009,4929102,C01,2015-01-20 14:00:00] max_rpm
[2009,4929102,P01,2015-01-20 09:00:00] min_ppm
---res: [2009,4929102,2015-01-20 13:00:00,NULL,2015-01-20 09:00:00,NULL] 
--- INCORRECT (max is not working)

推荐答案

让我们一步一步来.让我们先构建 2009 部分

Lets work it step by step. Lets first construct the 2009 part

event2009RDD.registerTempTable("base2009")
cellLookupRDD.registerTempTable("lookup")

trns2009 = ssc.sql("select eventtype, id, \
                          min(case when l.cn = 'RPM' then r.date1 else null end) rpmmn, \
max(case when l.cn = 'RPM' then r.date1 else null end) rpmmx, \
min(case when l.cn = 'PPM' then r.date1 else null end) ppmmn, \
max(case when l.cn = 'PPM' then r.date1 else null end) ppmmx, \
from base2009 r inner join lookup l on r.celltype=l.celltype \
group by eventtype,id "

trns2009 .registerTempTable("transformed2009")

现在您可以对 1001 个数据集进行完全外连接并获得输出.

Now you can do a full outer join with 1001 data set and get output.

注意:你不应该有

4929101,NULL,NULL,2015-01-20 20:44:39,NULL,NULL,NULL

相反,你应该有

4929101,NULL,NULL,2015-01-20 20:44:39,2015-01-20 20:44:39,NULL,NULL

因为,如果 2009 年的事件发生过一次,那么它应该有第一个和最后一个日期.NULL 应该表示从未发生过的事件,例如 id=4929101,celltype=PPM.

Because, if 2009 event have occurred once, then it should have both first and last date. NULL should represent an event never occurred, like for id=4929101, celltype=PPM.

请告诉我这是否有效(或无效).我现在无法访问 spark,但如果需要,今晚应该可以调试.

Please let me know if this works (or not). I do not have access to spark right this moment, but should be able to debug, if needed, tonight.

这篇关于Spark:分组RDD Sql查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆