Temporal加入Hive查询(时间紧邻的事件) [英] Temporal Join in Hive query (events in close proximity in time)
问题描述
我有一个如下所示的时间序列:
时间源word1 word2 ...等
2012-02-01 23:43:16.9988243 0001 2B3B FAF0
2012- 02-01 23:43:16.9993561 0002 2326 ABAA
2012-02-01 23:43:16.9998879 0002 2327 ABAA
我需要一个查询,如果一个 source
中的记录满足特定条件,除了该记录外,还应该返回一个或多个记录与第二个源
在时间上接近以满足不同的条件集。
我迄今为止的尝试如下所示:
SELECT time
FROM消息C
JOIN消息D在
D.time
之间C.time - INTERVAL'0.001'SECOND
和C.time + INTERVAL'0.001'SE COND
WHERE C.source ='0001'
AND D.Source ='0002'
AND C.word1 ='2B3B'
AND D.word2 = 'ABAA'
应该返回上面示例数据中的第一条和第二条记录(它不应该返回第三个,因为时间比.001秒更远)。
但是查询不起作用。错误消息是:
FAILED:SemanticException'0.001'遇到0个孩子
这是一个天真的解决方案:
<$ p $从消息中选择*
c
交叉连接消息m
其中m.time在c.time - interval'0.001'之间第二个
和c.time +间隔'0.001'秒
和c.word1 ='2B3B'
和m.word2 ='ABAA'
;
+ ---------------------------- + -------- + ------- + ---- --- + ---------------------------- + -------- + ------- + ------- +
|时间|来源| word1 | word2 |时间|来源| word1 | word2 |
+ ---------------------------- + -------- + ------- + ------- + ---------------------------- + -------- + --- ---- + ------- +
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+ ---------------------------- + -------- + ------- + ------- + ---------------------------- + -------- + --- ---- + ------- +
这是性能良好的解决方案。
select *
from messages c
在楼上加入消息m
(cast(c.time为小数点(37,7))/(2 * 0.001))=
floor(cast(m .time作为十进制(37,7))/(2 * 0.001))
其中m.time在c.time - interval'0.001'second
和c.time + 0.001'秒
和c.word1 ='2B3B'
和m.word2 ='ABAA'
联合所有
选择*
来自消息c
加入消息m
在楼层((cast(c.time如小数点(37,7))+ 0.001 )/(2 * 0.001))=
floor((cas t(m.time为十进制(37,7))+ 0.001)/(2 * 0.001))
where floor(cast(c.time as decimal(37,7))/(2 * 0.001))<>
floor(演员(m.time为小数点(37,7))/(2 * 0.001))
和m.time之间c.time - 间隔'0.001'秒
和c.time +间隔'0.001'秒
和c.word1 ='2B3B'
和m.word2 ='ABAA'
+ ---------- ------------------ + -------- + ------- ------- + ------ + ---------------------- + ------- + ------- + ------- +
|时间|来源| word1 | word2 | _col4 | _col5 | _col6 | _col7 |
+ ---------------------------- + -------- + ------- + ------- + ---------------------------- + ------- + ---- --- + ------- +
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+ ---------------------------- + -------- + ------- + ------- + ---------------------------- + ------- + ---- --- + ------- +
插图
活动A和B将被UNION ALL的上部抓住。
活动B和C将被UNION ALL的下部抓住。
0 0.002 0.004 0.006 0.008 0.01
| | | | | |
---------------------------------------------- ---------
| |
| |
A B C
| |
| |
---------------------------------------------- ---------
| | | | |
0.001 0.003 0.005 0.007 0.009
I have a need for a hive query that I'm having difficulty figuring out.
I have a time series that looks like this:
time source word1 word2 ...etc
2012-02-01 23:43:16.9988243 0001 2B3B FAF0
2012-02-01 23:43:16.9993561 0002 2326 ABAA
2012-02-01 23:43:16.9998879 0002 2327 ABAA
I need a query such that, if a record in one source
fulfills specific conditions, in addition to that record, it should also return one or more records in close proximity in time from the second source
fulfilling a different set of conditions.
My attempt so far looks like this:
SELECT time
FROM messages C
JOIN messages D on
D.time
BETWEEN C.time - INTERVAL '0.001' SECOND
AND C.time + INTERVAL '0.001' SECOND
WHERE C.source = '0001'
AND D.Source = '0002'
AND C.word1 = '2B3B'
AND D.word2 = 'ABAA'
Which should return the first and second records in the sample data above (It shouldn't return the third, because the time is farther away than .001 seconds).
But the query doesn't work. The error message is
FAILED: SemanticException '0.001' encountered with 0 children
This would be a naive solution:
select *
from messages c
cross join messages m
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
;
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| time | source | word1 | word2 | time | source | word1 | word2 |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
This is the solution with the good performance
select *
from messages c
join messages m
on floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) =
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
union all
select *
from messages c
join messages m
on floor ((cast(c.time as decimal(37,7)) + 0.001) / (2 * 0.001)) =
floor ((cast(m.time as decimal(37,7)) + 0.001) / (2 * 0.001))
where floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) <>
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
and m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| time | source | word1 | word2 | _col4 | _col5 | _col6 | _col7 |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
Illustration
Events A and B are going to be caught by the upper part of the UNION ALL.
Events B and C are going to be caught by the lower part of the UNION ALL.
0 0.002 0.004 0.006 0.008 0.01
| | | | | |
-------------------------------------------------------
| |
| |
A B C
| |
| |
-------------------------------------------------------
| | | | |
0.001 0.003 0.005 0.007 0.009
这篇关于Temporal加入Hive查询(时间紧邻的事件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!