Hive时间窗口函数的bug [英] Hive time window function's bug
问题描述
我有一个名为gmv_active_mem_monthly的表。整个行可以在这里看到:
I have a table named gmv_active_mem_monthly. The whole rows can been seen here:
month gmv_monthly active_member_monthly
201612 231657626042 2602064
201611 373576915733 3498039
201610 367824193757 3648708
201609 356167649082 3686007
201608 383362147243 3998595
201607 383828659139 3917252
201606 332929299345 3627298
201605 323084120955 3579938
201604 280834688208 3293682
201603 282180201106 3316420
201602 246386923468 3097107
201601 261355415707 3186347
201512 273860930491 3071105
201511 246606316046 2981534
201510 237766306308 2873558
201509 160390583711 2267418
201508 124370765573 2002018
201507 110236706032 1855539
201506 84844225170 1467889
201505 60651906632 1180800
201504 46808796126 917681
201503 12498656329 427529
201502 4918371362 190932
201501 2824293727 129203
I在配置单元中运行一个简单的代码:
I run a simple code in hive:
select month,
sum(gmv_monthly) over
(
order by "month"
rows between 12 preceding and 1 preceding
) as total_gmv,
sum(active_member_monthly) over
(
order by "month"
rows between 12 preceding and 1 preceding
) as total_active_mem
from novaya.gmv_active_mem_monthly
;
但是结果是错误的,而我在另一个数据集上使用相同的代码是正确的。
上面数据集的结果是:
But the result is total wrong while I use the same code on another dataset it is right. The result on the dataset above is :
month total_gmv total_active_mem
201501 NULL NULL
201502 2824293727 129203
201503 7742665089 320135
201504 20241321418 747664
201505 67050117544 1665345
201506 127702024176 2846145
201507 212546249346 4314034
201508 322782955378 6169573
201509 447153720951 8171591
201510 607544304662 10439009
201511 845310610970 13312567
201512 1091916927016 16294101
201601 1365777857507 19365206
201602 1624308979487 22422350
201603 1865777531593 25328525
201604 2135459076370 28217416
201605 2369484968452 30593417
201606 2631917182775 32992555
201607 2880002256950 35151964
201608 3153594210057 37213677
201609 3412585591727 39210254
201610 3608362657098 40628843
201611 3738420544547 41403993
201612 3865391144234 41920498
我们可以检查201602年的1624308979487减去201601年的1365777857507并不等于2016年的gmv_active_mem_monthly值。
那么代码有什么问题?代码在另一个数据集上运行完美,没有像这样的错误。
We can check that 1624308979487 from 201602 minus 1365777857507 from 201601 is not equal to 201601's value in gmv_active_mem_monthly. So what's wrong with the code? The code runs perfect on another dataset without error like this.
推荐答案
没有问题。结果是正确的。
与1个月不同,它与 2个月,即每个范围的边缘之一相关。
There is no problem. The results are correct.
The difference is not not with 1 month, it is with 2 months, one of each edge of the range.
201502 4,918,371,362 <-- This value goes only with 201601
201503 12,498,656,329
201504 46,808,796,126
201505 60,651,906,632
201506 84,844,225,170
201507 110,236,706,032
201508 124,370,765,573
201509 160,390,583,711
201510 237,766,306,308
201511 246,606,316,046
201512 273,860,930,491
201601 261,355,415,707 <-- This value goes only with 201602
201602
这篇关于Hive时间窗口函数的bug的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!