如何判断Postgres中的记录是否发生了变化 [英] How to tell if record has changed in Postgres

查看:201
本文介绍了如何判断Postgres中的记录是否发生了变化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个upsert类型的问题...但是,我想把它扔出去,因为它有点不同于我在stackoverflow上读过的任何东西。



基本问题。

我正在从mysql迁移到PostgreSQL 9.1.5(托管在Heroku )。作为其中的一部分,我需要每天导入多个CSV文件。一些数据是销售信息,几乎保证是新的,需要插入。但是,其他部分的数据几乎保证是一样的。例如,csv文件(注意复数)将在其中有POS(销售点)信息。这很少发生变化(而且很可能只是通过添加)。然后是产品信息。大约有一万种产品(绝大多数将保持不变,但是可以同时添加和更新)。
$ b 最后一项(但很重要)是我有一个要求,可以为任何给定的项目提供审计跟踪/信息。例如,如果我添加一个新的POS记录,我需要能够追溯到找到它的文件。如果我更改UPC代码或产品的描述,那么我需要能够追溯到它到导入(和文件)的变化来自。



我正在考虑的解决方案。

由于数据是通过CSV提供给我的,所以我正在围绕COPY将成为最好/最快的方式这一想法。文件中的数据结构并不完全是我在数据库中的(即最终目的地)。因此,我将它们复制到与CSV匹配的登台架构中的表(注意:每个数据源一个架构)。临时模式中的表将具有前置插入行触发器。这些触发器可以决定如何处理数据(插入,更新或忽略)。

对于最有可能包含新数据的表,则会先尝试插入。如果记录已经存在,那么它将返回NULL(并停止插入到登台表中)。对于很少更改的表,则会查询表并查看是否找到记录。如果是这样,那么我需要一种方法来查看是否有任何字段被改变。 (因为记住,我需要显示该记录是通过从文件y中导入x来修改的)。我显然可以将代码放在代码中,然后测试每一列。但是,正在寻找一些更雄辩,更可维护的东西。

在某种程度上,我所做的是将导入系统与审计跟踪系统相结合。因此,在研究审计线索时,我回顾了以下 wiki.postgresql.org 文章。看起来hstore可能是一个很好的方式来获得变化(并能够轻松地忽略表中不重要的一些列 - 例如last_modified)

我大概有90%的确定它会工作...我已经创建了一些测试表等,并与它玩。



我的问题?



是一个更好,更可维护的方式来完成这个任务找到10K中的3个记录,需要更改数据库。我当然可以编写一个python脚本(或其他的东西)来读取文件,并试图找出每个记录的处理方式,但这样做效率极低,会导致大量的往返。



一些最后的东西:


  1. 我无法控制输入文件。如果他们只发给我三角洲的话,我会喜欢的,但是他们不这样做,这完全不在我的控制或影响范围之内。
  2. 他的系统正在增长,新的数据源很可能是添加,将大大增加正在处理的数据量(所以,我试图保持高效)

  3. 我知道这不是很好,简单的SO问题(如如何排序在Python中的列表),但我相信关于SO的一个伟大的事情是,你可以提出困难的问题,人们会分享他们的想法,他们认为最好的方式来解决它。


解决方案

我有很多类似的操作。我所做的是 COPY 临时临时表

  CREATE TEMP TABLE target_tmp AS 
SELECT * FROM target_tbl LIMIT 0; - 只复制结构,没有数据

COPY target_tmp FROM'/path/to/target.csv';

对于效果,运行 ANALYZE - temp。表格不能被autovacuum分析!

  ANALYZE target_tmp; 

为了提高性能,甚至可以在临时表上创建一个或两个索引,或者添加一个主键如果数据允许的话。

  ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id); 



然后使用完整的SQL命令来消化新的数据。

例如,如果目标表的主键是 target_id $ b

也许 DELETE 什么不存在?

  DELETE FROM target_tbl t 
WHERE NOT EXISTS(
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);

然后 UPDATE 现在已经有了:

  UPDATE target_tbl t 
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id

避免更新,只需添加:

  ... 
AND col1 IS DISTINCT FROM t1.col1; - 重复相关的列

或者,如果整行是相关的:

  ... 
AND t IS DISTINCT FROM t1; - 检查整行

然后 INSERT
$ p $ INSERT INTO target_tbl(target_id,col1)
SELECT t1.target_id ,t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING(target_id)
WHERE t.target_id IS NULL;

清理会话继续(临时表在会话结束时自动删除) p>

  DROP TABLE target_tmp; 

或者使用 ON COMMIT DROP 或类似的 CREATE TEMP TABLE
代码没有经过测试,但是应该可以在PostgreSQL的任何现代版本中工作,除了输入错误。

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.

Basic problem.

I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).

The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.

Solution that I'm contemplating.

Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).

For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.

In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")

I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.

My question?

Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.

A few final things:

  1. I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
  2. he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
  3. I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.

解决方案

I have lots of similar operations. What I do is COPY to temporary staging tables:

CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0;  -- only copy structure, no data

COPY target_tmp FROM '/path/to/target.csv';

For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!

ANALYZE target_tmp; 

Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.

ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);

You don't need the performance stuff for small imports.

Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..

Maybe DELETE what isn't there any more?

DELETE FROM target_tbl t
WHERE NOT EXISTS (
   SELECT 1 FROM target_tmp t1
   WHERE  t1.target_id = t.target_id
);

Then UPDATE what's already there:

UPDATE target_tbl t
SET    col1 = t1.col1
FROM   target_tmp t1
WHERE  t.target_id = t1.target_id

To avoid empty UPDATEs, simply add:

...
AND    col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns

Or, if the whole row is relevant:

...
AND    t IS DISTINCT FROM t1;         -- check the whole row

Then INSERT what's new:

INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM   target_tmp t1
LEFT   JOIN target_tbl t USING (target_id)
WHERE  t.target_id IS NULL;

Clean up if your session goes on (temp tables are dropped at end of session automatically):

DROP TABLE target_tmp;

Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

这篇关于如何判断Postgres中的记录是否发生了变化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆