如何加快表之间的差异? [英] How can I speed up a diff between tables?

查看:131
本文介绍了如何加快表之间的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做postgresql中的表之间的差异,需要很长时间,因为每个表是~13GB ...
我当前的查询是:

I am working on doing a diff between tables in postgresql, it takes a long time, as each table is ~13GB... My current query is:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB;

SELECT * FROM tableB EXCEPT SELECT * FROM tableA;

当我在两张(未编制索引)的表上进行差异时,需要1:40小时(1小时) 40分钟)为了获得新的和删除的行,我需要运行查询两次,将总时间带到3:30小时。

When I do a diff on the two (unindexed) tables it takes 1:40 hours (1 hour and 40 minutes) In order to get both the new and removed rows I need to run the query twice, bringing the total time to 3:30 hours.

我跑了Postgresql EXPLAIN查询它,看看它在做什么。看起来它正在排序第一个表,然后是第二个表,然后比较它们。好吧,这让我觉得,如果我索引表格,他们将被预先排序,差异查询将更快。

I ran the Postgresql EXPLAIN query on it to see what it was doing. It looks like it is sorting the first table, then the second, then comparing them. Well that made me think that if I indexed the tables they would be presorted and the diff query would be much faster.

索引每个表花了45分钟。一旦索引,每个Diff需要1:35小时。
为什么索引只能在总差异时间内减少5分钟?我认为它会超过一半,因为在未编制索引的查询中,我将每个表排序两次(我需要运行两次查询)

Indexing each table took 45 minutes. Once Indexed, each Diff took 1:35 hours. Why do the indexes only shave off 5 minutes off the total diff time? I would assume that it would be more than half, since in the unindexed queries I am sorting each table twice (I need to run the query twice)

因为其中一个表格不会有太大变化,只需要索引一次,另一个会每天更新。所以索引方法的总运行时间为45分钟,加上差异为2x 1:35,总共3:55小时,差不多4小时。

Since one of these tables will not be changing much, it will only need to be indexed once, the other will be updated daily. So the total runtime for the indexed method is 45 minutes for the index, plus 2x 1:35 for the diff, giving a total of 3:55 hours, almost 4hours.

我在这里做错了什么,我不可能看到为什么索引我的净差异时间大于没有它?

What am I doing wrong here, I can't possibly see why with the index my net diff time is larger than without it?

这是对我的一点参考其他问题: Postgresql UNION需要10运行单个查询的次数

This is in slight reference to my other question here: Postgresql UNION takes 10 times as long as running the individual queries

编辑:
这是两个表的模式,除了表名之外它们是相同的。

Here is the schema for the two tables, they are identical except the table name.

CREATE TABLE bulk.blue
(
  "partA" text NOT NULL,
  "type" text NOT NULL,
  "partB" text NOT NULL
)
WITH (
  OIDS=FALSE
);


推荐答案

在上面的陈述中,您没有使用索引。

In the statements above you are not using the indexes.

你可以这样做:

SELECT * FROM tableA a
  FULL OUTER JOIN tableB b ON a.someID = b.someID

然后你可以使用显示哪些表缺少值的相同语句

You could then use the same statement to show which tables had missing values

SELECT * FROM tableA a
  FULL OUTER JOIN tableB b ON a.someID = b.someID
  WHERE ISNULL(a.someID) OR ISNULL(b.someID)

这应该为您提供表A或表B

This should give you the rows that were missing in table A OR table B

这篇关于如何加快表之间的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆