可以并行化每个BY [英] Parallelizable OVER EACH BY
问题描述
是否有最佳做法或建议如何使用窗口函数(Over())与超大数据集无法在单个节点上处理?
碎片化我的数据并使用不同的过滤器运行相同的查询可以工作,但它非常有限,花费大量时间(和手工劳动)并且代价高昂(对同一数据运行相同的查询设置30次,而不是一次)。
提到Jeremy的答案:
更好,但仍然无法正常工作。
如果我拿出我的原始查询样本:
select title,count(contributor_id<> LeadContributor then 1的情况
count(当contributor_id = LeadContributor,然后1 else null结束时),
count(*)作为总和
from
(
作为LeadContributor
FROM [publicdata:samples.wikipedia]
其中regexp_match(title,r'^ [A,B])是SELECT title,contributor_id,lead(contributor_id)over(根据标题顺序由时间戳分区) ')= true
)
按标题分组
现在可以运行...
但是
select title,count(contributor_id<> LeadContributor then 1 else null end时的情况)
count(contributor_id = LeadContributor则1则其他null结束)的情况相同,
作为总数
从
(
)计数(*)SELECT title,contributor_id,lead (contributor_id)over(按标题顺序按时间戳分区)作为LeadContributor
FROM [publicdata:samples.wikipedia ]
其中regexp_match(标题,r'^ [AZ]')=真
)
每个标题组
再次提供Resources Exceeded Error ...
根据OVER中给出的PARTITION BY子句以分布式方式执行。如果你为你的窗口函数提供PARTITION BY,你的数据将被并行处理,类似于JOIN EACH和GROUP EACH BY的处理方式。
另外,你可以在JOIN EACH或GROUP EACH BY的输出上使用PARTITION BY,而无需序列化执行。使用与JOIN EACH或GROUP EACH BY相同的PARTITION BY键是非常有效的,因为数据不需要在加入/聚合和窗口函数执行之间重新组合。I am hitting this obstacle again and again...
JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions
Is there a best practice or recommendations how to use window functions (Over()) with very large data sets that cannot be processed on a single node?
Fragmenting my data and running the same query with different filters can work, but its very limiting, takes lot of time (and manual labor) and costly (running same query on the same data set 30 times instead of once).
Referring to Jeremy's answer bellow... It's better, but still doesn't work properly. If I take my original query sample:
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true
)
group by title
Now works... But
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A-Z]')=true
)
group each by title
Gives again the Resources Exceeded Error...
Window functions can now be executed in distributed fashion according to the PARTITION BY clause given inside OVER. If you supply a PARTITION BY with your window functions, your data will be processed in parallel similar to how JOIN EACH and GROUP EACH BY are processed.
In addition, you can use PARTITION BY on the output of JOIN EACH or GROUP EACH BY without serializing execution. Using the same keys for PARTITION BY as for JOIN EACH or GROUP EACH BY is particularly efficient, because the data will not need to be reshuffled between join/aggregation and window function execution.
这篇关于可以并行化每个BY的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!