Hive 查询中的临时连接(时间上很接近的事件) [英] Temporal Join in Hive query (events in close proximity in time)
问题描述
我需要一个难以弄清楚的 hive 查询.
I have a need for a hive query that I'm having difficulty figuring out.
我有一个如下所示的时间序列:
I have a time series that looks like this:
time source word1 word2 ...etc
2012-02-01 23:43:16.9988243 0001 2B3B FAF0
2012-02-01 23:43:16.9993561 0002 2326 ABAA
2012-02-01 23:43:16.9998879 0002 2327 ABAA
我需要这样一个查询,如果一个source
中的记录满足特定条件,除了该记录外,它还应该返回一个或多个时间接近的记录 来自第二个 source
满足一组不同的条件.
I need a query such that, if a record in one source
fulfills specific conditions, in addition to that record, it should also return one or more records in close proximity in time from the second source
fulfilling a different set of conditions.
到目前为止我的尝试是这样的:
My attempt so far looks like this:
SELECT time
FROM messages C
JOIN messages D on
D.time
BETWEEN C.time - INTERVAL '0.001' SECOND
AND C.time + INTERVAL '0.001' SECOND
WHERE C.source = '0001'
AND D.Source = '0002'
AND C.word1 = '2B3B'
AND D.word2 = 'ABAA'
哪个应该返回上面示例数据中的第一条和第二条记录(它不应该返回第三条,因为时间比 0.001 秒更远).
Which should return the first and second records in the sample data above (It shouldn't return the third, because the time is farther away than .001 seconds).
但是查询不起作用.错误信息是
But the query doesn't work. The error message is
FAILED: SemanticException '0.001' encountered with 0 children
推荐答案
这将是一个天真的解决方案:
This would be a naive solution:
select *
from messages c
cross join messages m
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
;
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| time | source | word1 | word2 | time | source | word1 | word2 |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+--------+-------+-------+
这是性能良好的解决方案
This is the solution with the good performance
select *
from messages c
join messages m
on floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) =
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
where m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
union all
select *
from messages c
join messages m
on floor ((cast(c.time as decimal(37,7)) + 0.001) / (2 * 0.001)) =
floor ((cast(m.time as decimal(37,7)) + 0.001) / (2 * 0.001))
where floor (cast(c.time as decimal(37,7)) / (2 * 0.001)) <>
floor (cast(m.time as decimal(37,7)) / (2 * 0.001))
and m.time between c.time - interval '0.001' second
and c.time + interval '0.001' second
and c.word1 = '2B3B'
and m.word2 = 'ABAA'
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| time | source | word1 | word2 | _col4 | _col5 | _col6 | _col7 |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
| 2012-02-01 23:43:16.998824 | 0001 | 2B3B | FAF0 | 2012-02-01 23:43:16.999356 | 0002 | 2326 | ABAA |
+----------------------------+--------+-------+-------+----------------------------+-------+-------+-------+
插图
事件 A 和 B 将被 UNION ALL 的上半部分捕获.
事件 B 和 C 将被 UNION ALL 的下半部分捕获.
Illustration
Events A and B are going to be caught by the upper part of the UNION ALL.
Events B and C are going to be caught by the lower part of the UNION ALL.
0 0.002 0.004 0.006 0.008 0.01
| | | | | |
-------------------------------------------------------
| |
| |
A B C
| |
| |
-------------------------------------------------------
| | | | |
0.001 0.003 0.005 0.007 0.009
这篇关于Hive 查询中的临时连接(时间上很接近的事件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!