如何使用此异常匹配条件编写联接? [英] How do I write a join with this unusual matching criteria?

查看:81
本文介绍了如何使用此异常匹配条件编写联接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想左联接"表,以便将值不仅联接到匹配的行,而且联接到任何后续的不匹配的行,直到下一个匹配的行.换句话说,我想用以前的非空值填充空值.

样本数据和所需结果:

x:

 id 
----
  1
  2
  3
  4
  5

y:

 id | val 
----+-----
  1 | a
  4 | b

select x.id, y.val from x left join y on x.id=y.id order by x.id;的结果:

 id | val 
----+-----
  1 | a
  2 | 
  3 | 
  4 | b
  5 | 

所需结果:

 id | val 
----+-----
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b

解决方案

指数

x.idy.id上创建索引-如果这些是您的主键,则可能已经拥有它们.
多列索引也可能有帮助,尤其是对于9.2版中的仅索引扫描 +:

 CREATE INDEX y_mult_idx ON y (id DESC, val)
 

但是,在我的测试中,最初没有使用该索引.必须向ORDER BY添加(否则毫无意义)val以说服查询计划者排序顺序匹配.请参阅查询 3 .

此综合设置中的索引几乎没有区别.但是对于具有更多列的表,从表中检索val变得越来越昂贵,从而使覆盖"索引更具吸引力.

查询

1)简单

 SELECT DISTINCT ON (x.id)
       x.id, y.val
FROM   x
JOIN   y ON y.id <= x.id
ORDER  BY x.id, y.id DESC;
 

SQL小提琴.

有关此答案的DISTINCT技术的更多说明:

我进行了一些测试,因为我怀疑第一个查询不能很好地扩展.小桌子很快,但是大桌子不好. Postgres并未优化计划,而是以(有限的)交叉联接开始,成本为O(N²).

2)快速

此查询仍然非常简单,并且可以很好地扩展:

 SELECT x.id, y.val
FROM   x
JOIN  (SELECT *, lead(id, 1, 2147483647) OVER (ORDER BY id) AS next_id FROM y) y
       ON  x.id >= y.id
       AND x.id <  y.next_id
ORDER  BY 1;
 

窗口函数 lead() 是有用的.我使用该选项来提供默认值以覆盖最后一行的特殊情况:2147483647 SELECT x.id ,(SELECT val FROM y WHERE id <= x.id ORDER BY id DESC, val LIMIT 1) AS val FROM x;

通常,相关子查询往往比较慢.但这只不过是从(覆盖)索引中选取一个值,否则就很容易竞争.

附加的ORDER BYval(粗体)似乎毫无意义.但是添加它可以使查询计划者确信可以从上方使用多列索引y_mult_idx,因为排序顺序是匹配的.注意

使用y_mult_idx进行仅索引扫描..

EXPLAIN输出中的

.

测试用例

经过激烈的辩论和多次更新,我收集了到目前为止发布的所有查询,并制作了一个测试用例以进行快速概述.我只使用1000行,因此SQLfiddle不会因较慢的查询而超时.但是在我所有的本地测试中,前4位(Erwin 2,Clodoaldo,a_horse,Erwin 3)都是线性增长的. 再次更新以包括我最新添加的内容,现在通过性能改进格式和顺序:

Big SQL Fiddle 比较性能.

I want to "left join" a table so that a value is joined not just to a matching row, but also to any subsequent non-matching rows, up to the next matching row. To put it another way, I want to fill in nulls with the previous non-null value.

Sample data and desired result:

Table x:

 id 
----
  1
  2
  3
  4
  5

Table y:

 id | val 
----+-----
  1 | a
  4 | b

Result of select x.id, y.val from x left join y on x.id=y.id order by x.id;:

 id | val 
----+-----
  1 | a
  2 | 
  3 | 
  4 | b
  5 | 

Desired result:

 id | val 
----+-----
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b

解决方案

Indices

Create indices on x.id and y.id - which you probably already have if those are your primary keys.
A multi-column index may help, too, especially with index only scans in pg 9.2+:

CREATE INDEX y_mult_idx ON y (id DESC, val)

However, in my tests, this index was not used at first. Had to add (otherwise pointless) val to ORDER BY to convince the query planner that the sort order matches. See query 3.

The index makes little difference in this synthetic setup. But for tables with more columns, retrieving val from the table becomes increasingly expensive, making the "covering" index more attractive.

Queries

1) Simple

SELECT DISTINCT ON (x.id)
       x.id, y.val
FROM   x
JOIN   y ON y.id <= x.id
ORDER  BY x.id, y.id DESC;

SQL Fiddle.

More explanation for the technique with DISTINCT in this related answer:

I ran some tests because I had my suspicions that the first query wouldn't scale well. It's fast with a small table, but no good with bigger tables. Postgres doesn't optimize the plan and starts with a (limited) cross join, with a cost of O(N²).

2) Fast

This query is still rather simple and scales excellently:

SELECT x.id, y.val
FROM   x
JOIN  (SELECT *, lead(id, 1, 2147483647) OVER (ORDER BY id) AS next_id FROM y) y
       ON  x.id >= y.id
       AND x.id <  y.next_id
ORDER  BY 1;

The window function lead() is instrumental. I make use of the option to provide a default to cover the corner case of the last row: 2147483647 is the biggest possible integer. Adapt to your data type.

3) Very simple and almost as fast

SELECT x.id
     ,(SELECT val FROM y WHERE id <= x.id ORDER BY id DESC, val LIMIT 1) AS val
FROM   x;

Normally, correlated subqueries tend to be slow. But this one can just pick a value from the (covering) index and is otherwise so simple that it can compete.

The additional ORDER BY item val (bold emphasis) seems pointless. But adding it convinces the query planner that it's ok to use the multi-column index y_mult_idx from above, because the sort order matches. Note the

Index Only Scan using y_mult_idx ..

in the EXPLAIN output.

Test case

After a lively debate and multiple updates I collected all queries posted so far and made a test case for a quick overview. I only use 1000 rows so SQLfiddle does not time out with the slower queries. But the top 4 (Erwin 2, Clodoaldo, a_horse, Erwin 3) scale linearly in all my local tests. Updated once more to include my latest addition, improve format and order by performance now:

Big SQL Fiddle comparing performance.

这篇关于如何使用此异常匹配条件编写联接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆