可以并行化每个BY [英] Parallelizable OVER EACH BY

查看:74
本文介绍了可以并行化每个BY的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一次又一次地碰到了这个障碍......

>



是否有最佳做法或建议如何使用窗口函数(Over())与超大数据集无法在单个节点上处理?



碎片化我的数据并使用不同的过滤器运行相同的查询可以工作,但它非常有限,花费大量时间(和手工劳动)并且代价高昂(对同一数据运行相同的查询设置30次,而不是一次)。



提到Jeremy的答案:
更好,但仍然无法正常工作。
如果我拿出我的原始查询样本:

  select title,count(contributor_id<> LeadContributor then 1的情况
count(当contributor_id = LeadContributor,然后1 else null结束时),
count(*)作为总和
from

作为LeadContributor
FROM [publicdata:samples.wikipedia]
其中regexp_match(title,r'^ [A,B])是SELECT title,contributor_id,lead(contributor_id)over(根据标题顺序由时间戳分区) ')= true

按标题分组

现在可以运行...
但是

  select title,count(contributor_id<> LeadContributor then 1 else null end时的情况) 
count(contributor_id = LeadContributor则1则其他null结束)的情况相同,
作为总数


)计数(*)SELECT title,contributor_id,lead (contributor_id)over(按标题顺序按时间戳分区)作为LeadContributor
FROM [publicdata:samples.wikipedia ]
其中regexp_match(标题,r'^ [AZ]')=真

每个标题组

再次提供Resources Exceeded Error ...

解决方案

根据OVER中给出的PARTITION BY子句以分布式方式执行。如果你为你的窗口函数提供PARTITION BY,你的数据将被并行处理,类似于JOIN EACH和GROUP EACH BY的处理方式。

另外,你可以在JOIN EACH或GROUP EACH BY的输出上使用PARTITION BY,而无需序列化执行。使用与JOIN EACH或GROUP EACH BY相同的PARTITION BY键是非常有效的,因为数据不需要在加入/聚合和窗口函数执行之间重新组合。


I am hitting this obstacle again and again...

JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions

Is there a best practice or recommendations how to use window functions (Over()) with very large data sets that cannot be processed on a single node?

Fragmenting my data and running the same query with different filters can work, but its very limiting, takes lot of time (and manual labor) and costly (running same query on the same data set 30 times instead of once).

Referring to Jeremy's answer bellow... It's better, but still doesn't work properly. If I take my original query sample:

select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor  
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true
)
group by title

Now works... But

select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor  
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A-Z]')=true
)
group each by title

Gives again the Resources Exceeded Error...

解决方案

Window functions can now be executed in distributed fashion according to the PARTITION BY clause given inside OVER. If you supply a PARTITION BY with your window functions, your data will be processed in parallel similar to how JOIN EACH and GROUP EACH BY are processed.

In addition, you can use PARTITION BY on the output of JOIN EACH or GROUP EACH BY without serializing execution. Using the same keys for PARTITION BY as for JOIN EACH or GROUP EACH BY is particularly efficient, because the data will not need to be reshuffled between join/aggregation and window function execution.

这篇关于可以并行化每个BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆