查询超时6小时后,如何优化? [英] Query times out after 6 hours, how to optimize it?

查看:54
本文介绍了查询超时6小时后,如何优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表,分别是 shapes squares ,这些表是根据 GEOGRAHPY 列的交集加入的.

I have two tables, shapes and squares, that I'm joining based on intersections of GEOGRAHPY columns.

shapes 表包含车辆的行驶路线:

The shapes table contains travel routes for vehicles:

shape_key        STRING            identifier for the shape
shape_lines      ARRAY<GEOGRAPHY>  consecutive line segments making up the shape
shape_geography  GEOGRAPHY         the union of all shape_lines
shape_length_km  FLOAT64           length of the shape in kilometers

Rows: 65k
Size: 718 MB

我们将 shape_lines 分隔在 ARRAY 中,因为形状有时会自身重返,我们希望将这些线段分隔开,而不是

We keep shape_lines separated out in an ARRAY because shapes sometimes double back on themselves, and we want to keep those line segments separate instead of deduplicating them.

squares 表包含1×1 km正方形的网格:

The squares table contains a grid of 1×1 km squares:

square_key        INT64      identifier of the grid square
square_geography  GEOGRAPHY  four-cornered polygon describing the grid square

Rows: 102k
Size: 15 MB

这些形状代表车辆的行驶路线.对于每种形状,我们都在单独的表格中计算了有害物质的排放量.目的是计算每个网格平方的排放量,前提是假设这些排放量沿路线均匀分布.为此,我们需要知道路线形状的哪些部分与每个网格单元相交.

The shapes represent travel routes for vehicles. For each shape, we have computed emissions of harmful substances in a separate table. The aim is to calculate the emissions per grid square, assuming that they are evenly distributed along the route. To that end, we need to know what portion of the route shape intersects with each grid cell.

以下是用于计算该查询的查询:

Here's the query to compute that:

SELECT
  shape_key,
  square_key,
  SAFE_DIVIDE(
      (
        SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
        FROM UNNEST(shape_lines) AS line
      ),
      shape_length_km)
    AS square_portion
FROM
  shapes,
  squares
WHERE
  ST_INTERSECTS(shape_geography, square_geography)

遗憾的是,此查询在6小时后超时,而没有产生有用的结果.

Sadly, this query times out after 6 hours instead of producing a useful result.

在最坏的情况下,查询可以产生66亿行,但实际上不会发生.我估计每个形状通常相交可能是50个网格正方形,因此输出应为65k * 50 = 3.3M行;BigQuery不应该处理的所有内容.

In the worst case, the query can produce 6.6 billion rows, but that will not happen in practice. I estimate that each shape typically intersects maybe 50 grid squares, so the output should be around 65k * 50 = 3.3M rows; nothing that BigQuery shouldn't be able to handle.

我考虑了BigQuery执行的地理连接优化:

I have considered the geographic join optimizations performed by BigQuery:

空间联接是 WHERE 子句中具有谓词地理功能的两个表的联接.

Spatial JOINs are joins of two tables with a predicate geographic function in the WHERE clause.

检查.我什至将我的 INNER JOIN 重写为上面所示的等效逗号"联接.

Check. I even rewrote my INNER JOIN to the equivalent "comma" join shown above.

当您保留地理数据时,空间连接的性能会更好.

Spatial joins perform better when your geography data is persisted.

检查. shape_geography square_geography 都直接来自现有表格.

Check. Both shape_geography and square_geography come straight from existing tables.

BigQuery使用以下标准SQL谓词函数为INNER JOIN和CROSS JOIN运算符实现了优化的空间JOIN:[...] ST_Intersects

检查.只需一个 ST_Intersect 调用,没有其他条件.

Check. Just a single ST_Intersect call, no other conditions.

空间连接未优化:对于LEFT,RIGHT或FULL OUTER连接;如果涉及到ANTI联接;当空间谓词取反时.

Spatial joins are not optimized: for LEFT, RIGHT or FULL OUTER joins; in cases involving ANTI joins; when the spatial predicate is negated.

检查.这些情况均不适用.

Check. None of these cases apply.

因此,我认为BigQuery应该能够使用其使用的任何空间索引数据结构来优化此联接.

So I think BigQuery should be able to optimize this join using whatever spatial indexing data structures it uses.

我还考虑了有关交叉连接的建议:

避免连接产生的输出多于输入的连接.

Avoid joins that generate more outputs than inputs.

此查询肯定产生比输入更多的输出;这是它的本质,无法避免.

This query definitely generates more outputs than inputs; that's in its nature and cannot be avoided.

当需要 CROSS JOIN 时,请预先汇总您的数据.

When a CROSS JOIN is required, pre-aggregate your data.

为避免与产生比输入更多的输出的联接相关的性能问题:

To avoid performance issues associated with joins that generate more outputs than inputs:

  • 使用GROUP BY子句预聚合数据.

检查.我已经预先汇总了按形状分组的排放数据,以便 shapes 表中的每个形状都是唯一且不同的.

Check. I already pre-aggregated the emissions data grouped by shapes, so that each shape in the shapes table is unique and distinct.

  • 使用窗口功能.窗口功能通常比使用交叉联接更有效.有关更多信息,请参见分析函数.

我认为无法在此查询中使用窗口函数.

I don't think it's possible to use a window function for this query.

我怀疑BigQuery会根据输入行数而不是中间表或输出的大小来分配资源.那可以解释我所看到的病理行为.

I suspect that BigQuery allocates resources based on the number of input rows, not on the size of the intermediate tables or output. That would explain the pathological behaviour I'm seeing.

如何使此查询在合理的时间内运行?

How can I make this query run in reasonable time?

推荐答案

我认为 squares 被倒置,导致地球多边形几乎满了.

I think the squares got inverted, resulting in almost-full Earth polygons:

select st_area(square_geography), * from   `open-transport-data.public.squares`

打印结果,如 5.1E14 -完整的地球仪区域.因此,任何一条线都几乎与所有正方形相交.有关详细信息,请参见BigQuery文档: https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation

Prints results like 5.1E14 - which is full globe area. So any line intersects almost all the squares. See BigQuery doc for details : https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation

您可以通过运行 ST_GeogFromText(wkt,FALSE)反转它们-选择较小的多边形,而忽略多边形的方向,这相当快:

You can invert them by running ST_GeogFromText(wkt, FALSE) - which chooses smaller polygon, ignoring polygon orientation, this works reasonably fast:

SELECT
  shape_key,
  square_key,
  SAFE_DIVIDE(
      (
        SELECT SUM(ST_LENGTH(ST_INTERSECTION(line, square_geography))) / 1000
        FROM UNNEST(shape_lines) AS line
      ),
      shape_length_km)
    AS square_portion
FROM
  `open-transport-data.public.shapes`,
  (select 
       square_key, 
       st_geogfromtext(st_astext(square_geography), FALSE) as square_geography,
     from `open-transport-data.public.squares`) squares
WHERE
  ST_INTERSECTS(shape_geography, square_geography)

这篇关于查询超时6小时后,如何优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆