Postgresql - 插入记录的清洁方法,如果他们不存在,更新如果他们这样做 [英] Postgresql - Clean way to insert records if they don't exist, update if they do

查看:134
本文介绍了Postgresql - 插入记录的清洁方法,如果他们不存在,更新如果他们这样做的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是我的情况。我有一个表有一堆的URL和抓取
日期与他们相关联。当我的程序处理一个URL时,我想要
插入一个带有抓取日期的新行。如果URL已存在,I
要将抓取日期更新为当前日期时间。使用MS SQL或
Oracle我可能使用MERGE命令。使用mySQL我可以使用ON DUPLICATE KEY UPDATE语法。



我可以在我的程序中执行多个查询,可能不是
线程安全。我可以写一个SQL函数,它有各种IF ... ELSE
逻辑。但是,为了尝试Postgres功能,我以前从未使用过
,我正在考虑创建一个INSERT规则 -
这样:

  CREATE RULE Pages_Upsert AS ON INSERT TO Pages 
WHERE EXISTS(SELECT 1 from Pages P where NEW.Url = P.Url)
DO INSTEAD
UPDATE Pages SET LastCrawled = NOW(),Html = NEW.Html WHERE Url = NEW.Url;

这似乎实际上很好。它可能在
的代码可读性的立场上失去一些观点,因为有人看着我的代码
第一次将不得不神奇地知道这个规则,但我
猜测,可以是解决了良好的代码注释和
文档。



这个想法还有其他缺点,或者你的想法
糟糕,你应该做它/这/方式评论?

UPDATE :因为有人想要查询计划:)

 Insert(cost = 2.79..2.81 rows = 1 width = 0)
InitPlan 1(返回$ 0)
- > Seq在页面p上扫描(cost = 0.00..2.79 rows = 1 width = 0)
过滤器:('http://www.foo.com':: text = lower((url):: text))
- >结果(cost = 0.00..0.01 rows = 1 width = 0)
一次性过滤器: )

Update(cost = 2.79..5.46 rows = 1 width = 111)
InitPlan 1(返回$ 0)
- &扫描页面p(cost = 0.00..2.79 rows = 1 width = 0)
过滤器:('http://www.foo.com':: text = lower((url):: text ))
- >结果(cost = 0.00..2.67 rows = 1 width = 111)
一次性过滤器:$ 0
- & pages(cost = 0.00..2.66 rows = 1 width = 111)
过滤器:((url):: text ='http://www.foo.com':: text)


解决方案

好,我设法创建了一个测试用例。结果是更新部分总是被执行,即使是在新的插入。 COPY似乎绕过规则系统。
[为了清楚起见,我把这个单独的回复]

  DROP TABLE pages CASCADE; 
CREATE TABLE pages
(url VARCHAR NOT NULL PRIMARY KEY
,html VARCHAR
,last TIMESTAMP
);

INSERT INTO pages(url,html,last)VALUES('www.example.com://page1','meuk1','2001-09-18 23:30:00'::时间戳);

CREATE RULE Pages_Upsert AS ON INSERT TO pages
WHERE EXISTS(SELECT 1 from pages P where NEW.url = P.url)
DO INSTEAD(
UPDATE pages SET html = new.html,last = NOW()WHERE url = NEW.url
);

INSERT INTO pages(url,html,last)VALUES('www.example.com://page2','meuk2','2002-09-18 23:30:00'::时间戳);
INSERT INTO pages(url,html,last)VALUES('www.example.com://page3','meuk3','2003-09-18 23:30:00':: timestamp);

INSERT INTO pages(url,html,last)SELECT pp.url || '/ added':: text,pp.html || '.html':: text,pp.last + interval '20 years'FROM pages pp;

COPY页面(url,html,last)FROM STDIN;
www.example.com://pageX stdin 2000-09-18 23:30:00
\。

SELECT * FROM pages;

结果:

  url | html | last 
------------------------------- + ------------ + ----------------------------
www.example.com://page1 | meuk1 | 2001-09-18 23:30:00
www.example.com://page2 | meuk2 | 2011-09-18 23:48:30.775373
www.example.com://page3 | meuk3 | 2011-09-18 23:48:30.783758
www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
www.example.com://pageX | stdin | 2000-09-18 23:30:00
(7 rows)

更新:证明它可以完成:

  INSERT INTO页面(url,html,last)VALUES('www.example.com: // page1','meuk1','2001-09-18 23:30:00':: timestamp); 
CREATE VIEW vpages AS(SELECT * from pages);

CREATE RULE Pages_Upsert AS ON INSERT TO vpages
DO INSTEAD(
UPDATE pages p0
SET html = NEW.html,last = NOW()WHERE p0.url = NEW.url
;
INSERT INTO pages(url,html,last)
SELECT NEW.url,NEW.html,NEW.last
WHERE NOT EXISTS(SELECT * FROM pages p1 WHERE p1.url = NEW.url)
);

CREATE RULE Pages_Indate AS ON UPDATE TO vpages
DO INSTEAD(
INSERT INTO pages(url,html,last)
SELECT NEW.url,NEW.html, NEW.last
WHERE NOT EXISTS(SELECT * FROM pages p1 WHERE p1.url = OLD.url)
;
UPDATE pages p0
SET html = NEW.html,last = NEW.last WHERE p0.url = NEW.url
;
);

INSERT INTO vpages(url,html,last)VALUES('www.example.com://page2','meuk2','2002-09-18 23:30:00'::时间戳);
INSERT INTO vpages(url,html,last)VALUES('www.example.com://page3','meuk3','2003-09-18 23:30:00':: timestamp);

INSERT INTO vpages(url,html,last)SELECT pp.url || '/ added':: text,pp.html || '.html':: text,pp.last + interval '20 years'from vpages pp;
UPDATE vpages SET last = last + interval'-10 years'WHERE url ='www.example.com://page1';

- 复制不适用于视图
- COPY vpages(url,html,last)FROM STDIN;
- www.example.com://pageX stdin 2000-09-18 23:30:00
- \。

SELECT * FROM vpages;

结果:

 code> INSERT 0 1 
INSERT 0 1
INSERT 0 3
更新1
url | html | last
------------------------------- + ------------ + ---------------------
www.example.com://page2 | meuk2 | 2002-09-18 23:30:00
www.example.com://page3 | meuk3 | 2003-09-18 23:30:00
www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
www.example.com://page1 | meuk1 | 1991-09-18 23:30:00
(6 rows)

必须防止重写系统进入递归。
构造一个DELETE规则作为练习留给读者。


Here's my situation. I have a table with a bunch of URLs and crawl dates associated with them. When my program processes a URL, I want to INSERT a new row with a crawl date. If the URL already exists, I want to update the crawl date to the current datetime. With MS SQL or Oracle I'd probably use a MERGE command for this. With mySQL I'd probably use the ON DUPLICATE KEY UPDATE syntax.

I could do multiple queries in my program, which may or may not be thread safe. I could write a SQL function which has various IF...ELSE logic. However, for the sake of trying out Postgres features I've never used before, I'm thinking about creating an INSERT rule - something like this:

CREATE RULE Pages_Upsert AS ON INSERT TO Pages
  WHERE EXISTS (SELECT 1 from Pages P where NEW.Url = P.Url)
  DO INSTEAD
     UPDATE Pages SET LastCrawled = NOW(), Html = NEW.Html WHERE Url = NEW.Url;

This seems to actually work great. It probably loses some points on the "code readability" standpoint, as someone looking at my code for the first time would have to magically know about this rule, but I guess that could be solved with good code commenting and documentation.

Are there any other drawbacks to this idea, or maybe a "your idea sucks, you should do it /this/ way instead" comment? I'm on PG 9.0 if that matters.

UPDATE: Query plan since someone wanted it :)

"Insert  (cost=2.79..2.81 rows=1 width=0)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..0.01 rows=1 width=0)"
"        One-Time Filter: ($0 IS NOT TRUE)"
""
"Update  (cost=2.79..5.46 rows=1 width=111)"
"  InitPlan 1 (returns $0)"
"    ->  Seq Scan on pages p  (cost=0.00..2.79 rows=1 width=0)"
"          Filter: ('http://www.foo.com'::text = lower((url)::text))"
"  ->  Result  (cost=0.00..2.67 rows=1 width=111)"
"        One-Time Filter: $0"
"        ->  Seq Scan on pages  (cost=0.00..2.66 rows=1 width=111)"
"              Filter: ((url)::text = 'http://www.foo.com'::text)"

解决方案

Ok, I managed to create a testcase. The result is that the update part is always executed, even on a fresh insert. COPY seems to bypass the rule system. [For clarity I have put this into a separate reply]

DROP TABLE pages CASCADE;
CREATE TABLE pages
    ( url VARCHAR NOT NULL  PRIMARY KEY
    , html VARCHAR
    , last TIMESTAMP
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );

CREATE RULE Pages_Upsert AS ON INSERT TO pages
  WHERE EXISTS (SELECT 1 from pages P where NEW.url = P.url)
     DO INSTEAD (
     UPDATE pages SET html=new.html , last = NOW() WHERE url = NEW.url
    );

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO pages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO pages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM pages pp;

COPY pages(url,html,last) FROM STDIN;
www.example.com://pageX     stdin   2000-09-18 23:30:00
\.

SELECT * FROM pages;

The result:

              url              |    html    |            last            
-------------------------------+------------+----------------------------
 www.example.com://page1       | meuk1      | 2001-09-18 23:30:00
 www.example.com://page2       | meuk2      | 2011-09-18 23:48:30.775373
 www.example.com://page3       | meuk3      | 2011-09-18 23:48:30.783758
 www.example.com://page1/added | meuk1.html | 2011-09-18 23:48:30.792097
 www.example.com://page2/added | meuk2.html | 2011-09-18 23:48:30.792097
 www.example.com://page3/added | meuk3.html | 2011-09-18 23:48:30.792097
 www.example.com://pageX       | stdin      | 2000-09-18 23:30:00
 (7 rows)

UPDATE: Just to prove it can be done:

INSERT INTO pages(url,html,last) VALUES ('www.example.com://page1' , 'meuk1' , '2001-09-18 23:30:00'::timestamp );
CREATE VIEW vpages AS (SELECT * from pages);

CREATE RULE Pages_Upsert AS ON INSERT TO vpages
  DO INSTEAD (
     UPDATE pages p0
     SET html=NEW.html , last = NOW() WHERE p0.url = NEW.url
    ;
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = NEW.url)
    );

CREATE RULE Pages_Indate AS ON UPDATE TO vpages
  DO INSTEAD (
     INSERT INTO pages (url,html,last)
    SELECT NEW.url, NEW.html, NEW.last
        WHERE NOT EXISTS ( SELECT * FROM pages p1 WHERE p1.url = OLD.url)
        ;
     UPDATE pages p0
     SET html=NEW.html , last = NEW.last WHERE p0.url = NEW.url
        ;
    );

INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page2' , 'meuk2' , '2002-09-18 23:30:00':: timestamp );
INSERT INTO vpages(url,html,last) VALUES ('www.example.com://page3' , 'meuk3' , '2003-09-18 23:30:00':: timestamp );

INSERT INTO vpages(url,html,last) SELECT pp.url || '/added'::text, pp.html || '.html'::text , pp.last + interval '20 years' FROM vpages pp;
UPDATE vpages SET last = last + interval '-10 years' WHERE url = 'www.example.com://page1' ;

-- Copy does NOT work on views
-- COPY vpages(url,html,last) FROM STDIN;
-- www.example.com://pageX    stdin    2000-09-18 23:30:00
-- \.

SELECT * FROM vpages;

Result:

INSERT 0 1
INSERT 0 1
INSERT 0 3
UPDATE 1
              url              |    html    |        last         
-------------------------------+------------+---------------------
 www.example.com://page2       | meuk2      | 2002-09-18 23:30:00
 www.example.com://page3       | meuk3      | 2003-09-18 23:30:00
 www.example.com://page1/added | meuk1.html | 2021-09-18 23:30:00
 www.example.com://page2/added | meuk2.html | 2022-09-18 23:30:00
 www.example.com://page3/added | meuk3.html | 2023-09-18 23:30:00
 www.example.com://page1       | meuk1      | 1991-09-18 23:30:00
(6 rows)

The view is necessary to prevent the rewrite system to go into recursion. Construction of a DELETE rule is left as an exercise to the reader.

这篇关于Postgresql - 插入记录的清洁方法,如果他们不存在,更新如果他们这样做的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆