根据多个条件从表中删除重复项,并持续到其他表 [英] Remove duplicates from table based on multiple criteria and persist to other table

查看:246
本文介绍了根据多个条件从表中删除重复项,并持续到其他表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 taccounts 表,其中列如 account_id(PK) login_name 密码 last_login 。现在我必须根据新的业务逻辑删除一些重复的条目。
因此,重复的帐户将与相同的电子邮件 相同( login_name & 密码)。必须保留最新登录帐号。



这是我的尝试(一些电子邮件地址为空白)

  DELETE 
FROM taccounts
WHERE电子邮件不为空,char_length(trim(both''from email))> 0和last_login NOT IN

SELECT MAX(last_login)
FROM taccounts
WHERE电子邮件不为空,char_length(trim(both''from email))> 0
GROUP BY lower(trim (两者都来自电子邮件))

类似于 login_name 密码

  DELETE 
FROM taccounts
WHERE last_login NOT IN

SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name,password)
/ pre>

有没有更好的方法或任何方法来组合这两个单独的查询?



另外还有一些表格有 account_id 作为外键。 如何更新这些表的更改?`
我正在使用PostgreSQL 9.2.1



编辑:某些电子邮件值为空,其中一些为空(')。所以,如果两个帐号有不同的login_name&密码及其电子邮件为空或空白,那么它们必须被视为两个不同的帐户。

解决方案

幸运的是,您正在运行PostgreSQL。 DISTINCT ON 应该比较容易:



由于您要删除大部分行(〜90%这个表很容易适合RAM,我去了这条路线:


  1. SELECT
  2. 重新引导参考列。

  3. DELETE 从基表中的所有行。

  4. Re - INSERT 幸存者。



保留剩余行



  CREATE TEMP TABLE tmp AS 
SELECT DISTINCT ON (login_name,password)*
FROM(
SELECT DISTINCT ON(email)*
FROM taccounts
ORDER BY email,last_login DESC
)sub
ORDER BY login_name,password,last_login DESC;

更多关于 DISTINCT ON





删除两个不同标准的重复项我只是使用一个子查询,一个接一个地应用这两个规则。第一步保留最新的 last_login 的帐号,所以这是可序列化。



检查结果和测试为了合理性。

  SELECT * FROM tmp; 

临时表在会话结束时自动删除。在pgAdmin(您似乎正在使用的)中,只要编辑器窗口打开,您就可以生成临时表。



更新定义重复



  SELECT * 
FROM taccounts t
WHERE NOT EXISTS(
SELECT 1
FROM taccounts t1
WHERE(
NULLIF(t1.email,'')= t.email OR
(NULLIF(t1.login_name,''),NULLIF(t1.password ,''))
=(t.login_name,t.password)

AND(t1.last_login,t1.account_id)>(t.last_login,t.account_id)
);

这不会对待 NULL 或emtpy任何重复列中的字符串('')相同。



行表达式(t1.last_login,t1.account_id)处理两个副本可能共享相同的 last_login 的可能性。在这种情况下,我采取更大的 account_id ,因为它是PK。



如何识别所有传入的FK



  SELECT c.confrelid :: regclass :: text AS referenced_table 
,c.conname AS fk_name
,pg_get_constraintdef(c.oid)AS fk_definition
FROM pg_attribute a
JOIN pg_constraint c ON(c.conrelid,c.conkey [1])=(a.attrelid,a。 attnum)
WHERE c.confrelid ='taccounts':: regclass - (schema-qualified)表名称
AND c.contype ='f'
ORDER BY 1,contype DESC;

只能构建在外键的第一列上。更多关于:





或者,您可以在选择 taccounts后,检查pgAdmin对象浏览器右侧窗口中的 Dependents



重新路由到新主机



如果您有表引用 taccounts 传入外键 taccounts )您将要更新所有这些字段 之前,您将删除该副本。

将所有这些字段重新路由到新的主行:

 更新引用_blb 
SET引用_column = tmp.reference_column
FROM tmp
JOIN taccounts t1 USING(email)
WHERE r.referencing_column = t1.referencing_column
AND引用列IS DISTINCT FROM tmp.reference_column;

更新引用值b
SET引用_column = tmp.reference_column
FROM tmp
JOIN taccounts t2 USING(login_name,password)
WHERE r.referencing_column = t1 .referencing_column
AND referencing_column IS DISTINCT FROM tmp.reference_column;



进入杀死



现在这些笨蛋没有更多的链接。进入杀死。

  ALTER TABLE taccounts DISABLE TRIGGER ALL; 
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;

在操作期间我禁用所有触发器。这样可以避免在操作过程中检查引用完整性。一旦你重新启动触发器,一切都应该是好的。我们照顾了上面所有的传入的 FK。 FKs保证声音良好,因为您没有并发访问权限,所有值都已存在。


I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic. So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.

Here are my attempts (some email values are null and blank)

DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 
GROUP BY lower(trim(both ' ' from email)))

Similarly for login_name and password

DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)

Is there any better way or any way to combine these two separate queries?

Also some other table have account_id as foreign key. How to update this change for those tables?` I am using PostgreSQL 9.2.1

EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.

解决方案

Luckily you are running PostgreSQL. DISTINCT ON should make this comparatively easy:

Since you are going to delete most of the rows ( ~ 90 % dupes) and the table most probably fits into RAM easily, I went for this route:

  1. SELECT the surviving rows into a temporary table.
  2. Reroute referencing columns.
  3. DELETE all rows from the base table.
  4. Re-INSERT survivors.

Distill remaining rows

CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM  (
   SELECT DISTINCT ON (email) *
   FROM   taccounts
   ORDER  BY email, last_login DESC
   ) sub
ORDER  BY login_name, password, last_login DESC;

More about DISTINCT ON:

To remove duplicates for two different criteria I just use a subquery, to apply the two rules one after the other. The first step preserves th account with the latest last_login, so this is "serializable".

Inspect results and test for plausibility.

SELECT * FROM tmp;

A temporary table is dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window open in which you created the temporary table.

Alternative query for updated definition of "duplicates"

SELECT *
FROM   taccounts t
WHERE  NOT EXISTS (
   SELECT 1
   FROM   taccounts t1
   WHERE (
           NULLIF(t1.email, '') = t.email OR 
           (NULLIF(t1.login_name, ''), NULLIF(t1.password, ''))
         = (t.login_name, t.password)
         )
   AND   (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
   );

This doesn't treat NULL or emtpy string ('') as identical in any of the "duplicate" columns.

The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. I take the one with the bigger account_id in this case - which is unique, since it is the PK.

How to identify all incoming FKs

SELECT c.confrelid::regclass::text AS referenced_table
      ,c.conname AS fk_name
      ,pg_get_constraintdef(c.oid) AS fk_definition
FROM   pg_attribute a 
JOIN   pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE  c.confrelid = 'taccounts '::regclass   -- (schema-qualified) table name
AND    c.contype  = 'f'
ORDER  BY 1, contype DESC;

Only building on the first column of the foreign key. More about that:

Or you can inspect the Dependents rider in the right hand window of the object browser of pgAdmin, after selecting taccounts.

Reroute to new master

If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new master row:

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t1 USING (email)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t2 USING (login_name, password)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

Go in for the kill

Now, the dupes have no more links to them. Go in for the kill.

ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;

I disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine, once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent access and all values have been there before.

这篇关于根据多个条件从表中删除重复项,并持续到其他表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆