PostgreSQL:联接大表的小子集的最佳方法 [英] PostgreSQL: best way to join small subsets of large tables

查看:175
本文介绍了PostgreSQL:联接大表的小子集的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows计算机上使用PostgreSQL 9.3。
假设我有一个有400万条记录的客户表和一个有2500万条记录的订单表。
但是,我只对我的纽约客户感兴趣。纽约只有5,000位客户下达了15,000个订单,即子集大大减少。

I am using PostgreSQL 9.3 on a Windows machine. Let's say I have a clients table with 4 million records and an orders table with 25 million records. However, I am interested in my New York clients only. There are only 5,000 New York clients who placed 15,000 orders, i.e. a drastically smaller subset.

什么是检索客户ID和订单总数的最佳方法?

What is the best way to retrieve the client ids and the total number of orders ever placed by New York clients?

是相关子查询吗?

Select c.clientid
, ( select count(orders.clientid) from orders where orders.clientid = c.clientid) as NumOrders

From clients c
WHERE c.city = 'New York'

比像这样的联接要快

Select c.clientid
,coalesce(o.NumOrders,0) as NumOrders

From clients c

Left outer join
( select clientid, count(*) as NumOrders from orders group by clientid ) o
on c.clientid = o.clientid

WHERE c.city = 'New York'

因为纽约州将大部分时间都花在对记录进行计数上,然后由于记录不存在而被丢弃与纽约客户有关吗?
还是有更好的方法?

because the latter spends most of the time counting records which are then discarded since they don't relate to New York clients? Or is there a better way?

谢谢!

PS是的,我知道,我应该看一下执行计划,但是我是在家中编写的,因此我没有一个具有数百万条记录的数据库可以对其进行测试。

PS Yes, I know, I should look at the execution plan, but I am writing this from home and I don't have a database with millions of records to test this on.

推荐答案

正如您所提到的,真正知道的唯一方法是比较执行计划。实际上,最好的方法是使用 EXPLAIN ANALYZE ,以便它实际上执行查询并将结果插入到带有估算值的输出中,以便您了解

As you alluded to, the only way to truly know is to compare the execution plans. In fact, the best way would be to use EXPLAIN ANALYZE, so that it actually executes the query and inserts the results into output with the estimates, so you can get a sense of the query planner versus reality.

但是,总的来说,在这种情况下我会做的是创建临时表客户子集,然后 JOIN orders 表。您可以选择使用 WITH 代替在一个查询中完成所有操作。

However, in general, what I would do in a situation like this would probably be to create a temp table for client subset and then JOIN that to the orders table. You could optionally use WITH instead to do everything in one query.

因此,类似:

CREATE TEMP TABLE tmp_clients AS
SELECT c.clientid
FROM clients c
WHERE c.city = 'New York'
ORDER BY c.clientid;

SELECT *
FROM orders AS o
JOIN tmp_clients AS c ON (o.clientid = c.clientid)
ORDER BY o.clientid;

这样, tmp_clients 仅包含新约克客户-约5万行-就是将要连接到订单表的那个表。

This way, tmp_clients contains only the New York clients -- ~5K rows -- and it's that table that will be joined to the orders table.

您还可以进一步优化,在临时表(在clientid上),然后在执行 JOIN 之前先进行 ANALYZE ,以确保完全在索引。您需要检查每种情况下的查询计划,以了解相对差异(或者如果 JOIN 不够快,请记住这一点。 )。

You could also, to optimize further, create an index on the temp table (on the clientid) and then ANALYZE it before doing the JOIN to ensure that the JOIN is done purely on the index. You'd want to check the query plans in each case to see the relative difference (or just keep this in mind if the JOIN isn't quite as fast as you would like).

回复@poshest发表评论:

听起来像临时表不断堆积,这会增加内存占用,并且对于长时间运行的连接,功能似乎是内存泄漏。

That sounds like the temp tables are stacking up, which would increase the memory footprint, and, for a long-running connection, functionality appear to be a memory leak.

在那种情况下,这并不是真正的泄漏,因为临时表的作用域是连接。它们会自动消失,但直到连接结束后才会消失。但是,您可以在完成处理后立即使它们消失。只需像对待其他任何表一样,对表进行 DROP 即可,我怀疑您将能够多次调用该函数-在相同的连接-不会增加相同类型的单调内存占用量。

In that case, it wouldn't be a true leak, though, as temp tables are scoped to a connection. They disappear automatically, but not until after the connection ends. However, you can make them disappear right away when you're done with them. Simply DROP the table as you would any other once you're done with them, and I suspect you'll be able to call the function a bunch of times -- on the same connection -- without the same sort of monotonic memory footprint increase.

这篇关于PostgreSQL:联接大表的小子集的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆