使用PostgreSQL为符合条件的一组记录生成聚类字段 [英] Generate cluster field for a set of records which match the condition(s) using PostgreSQL

查看：37 发布时间：2022/2/27 21:08:28 sql postgresql group-by match

本文介绍了使用PostgreSQL为符合条件的一组记录生成聚类字段的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在PostgreSQL9.4中有一个表，其中包含以下字段ID(主键)、Customer_Name、Mobile、Email。ID列对于记录是唯一的，但是不一定标识唯一的个人。客户可以有多个名称不同的记录和/或不同的手机或电子邮件记录，每个记录都与唯一的ID链接。

我有一个要求，我需要一个名为Cluster_ID的新计算列(使用SQL查询)，该列将根据姓名、手机或电子邮件的匹配来唯一标识客户，即，如果一条记录的姓名、手机或电子邮件中的任何一条与另一条记录匹配，则应为这些记录分配相同的Cluster_ID。此Cluster_ID对于匹配的记录集应该是唯一的，最好在每次执行查询时都相同。

我已经为Postgres创建了一个示例DDL(可以在SQLfiddle.com上使用它进行测试)：

CREATE TABLE Customer (
    ID  integer,
    Name varchar(30),
    Mobile  varchar(20),
    Email  varchar(50)
);

INSERT INTO Customer (ID, Name, Mobile, Email) VALUES
    (1, 'Tim', '9876728382', 'tim@email.com'),
    (2, 'John', '9845323453', 'john@email.com'),
    (3, 'Tim', '8265748319', 'toy@test.com'),
    (4, 'John Snow', '9845323453', NULL),
    (5, 'Timmothy', '8265748319', 'timmothy@somemail.com'),
    (6, 'John', '8345908112', 'JohnySnow@someemail.com'),
    (7, 'Tim M. Jacob', NULL, 'timmothy@somemail.com'),
    (8, 'John P. Snow', '8345908112', NULL),
    (9, 'Rack', '7654783949', 'racky@email.com'),
    (10, 'Racky Dsouza', '9934364837', 'racky@email.com'),
    (11, 'Rock M. Dsouza', '9934364837', 'rackguy@somemail.com'),
    (12, 'John Snowden', '8463865392', 'John@someemail.com')
;

检查下面的链接，了解SQL查询的预期输出。请注意，我已突出显示与不同记录的其他值匹配的值(背景为浅黄色)。

https://docs.google.com/spreadsheets/d/1IjLfCuyKmizw0ywvDpGO_e08ATlSnlPr__UBWUsVCV0/pubhtml?gid=0&single=true

对于具有来自Name、Mobile或Email的匹配值之一的一组记录，分配的Cluster_ID最好相同。

推荐答案

实际上您正在尝试partition of a set进入disjoint sets。

一种想法是使用表示集合的对表进行分区，并实现find(Element)函数，该函数为给定表元素(行)确定表示不同集合的

有关详细信息，请参阅此链接：Disjoint-set data structure

一种常见的方法是选择每个集合中的一个固定元素，称为它的代表，代表整个集合。然后，找到(X) 返回x所属的集合的表示形式

假设我们将给定不相交子集表示定义为该子集中所有id元素的最小ID值。此representative value将是我们的cluster_id

在这种情况下，可以使用PostgreSQL WITH Queries (Common Table Expressions)实现find(X)函数(下面的示例确定表示具有id = 5的行的不相交子集)：

with recursive xxx( id, name, mobile, email ) AS( select * from customer where id = 5 union select c.* from customer c join xxx x on c.name = x.name or c.mobile = x.mobile or c.email = x.email ) select min(id) from xxx min | ----| 1 |

上面的查询可以用作子查询，以确定表中所有行的集合表示：

select q.*, ( with recursive xxx( id, name, mobile, email ) AS( select * from customer where id = q.id union select c.* from customer c join xxx x on c.name = x.name or c.mobile = x.mobile or c.email = x.email ) select min( id ) from xxx ) as cluster_id from customer q order by cluster_id, id; id |name |mobile |email |cluster_id | ---|---------------|-----------|------------------------|-----------| 1 |Tim |9876728382 |tim@email.com |1 | 3 |Tim |8265748319 |toy@test.com |1 | 5 |Timmothy |8265748319 |timmothy@somemail.com |1 | 7 |Tim M. Jacob | |timmothy@somemail.com |1 | 2 |John |9845323453 |john@email.com |2 | 4 |John Snow |9845323453 | |2 | 6 |John |8345908112 |JohnySnow@someemail.com |2 | 8 |John P. Snow |8345908112 | |2 | 9 |Rack |7654783949 |racky@email.com |9 | 10 |Racky Dsouza |9934364837 |racky@email.com |9 | 11 |Rock M. Dsouza |9934364837 |rackguy@somemail.com |9 | 12 |John Snowden |8463865392 |John@someemail.com |12 |

这可能适用于小型数据集，但如果您的表有很多记录，则此查询的速度可能会很糟糕。

您可以在这里找到一些如何改进此算法或实现更好算法的提示：Partition refinement，这很可能需要实现适当的数据结构(双向链表或数组，具体取决于算法)，在这种情况下，SQL表不是最佳选择。

这篇关于使用PostgreSQL为符合条件的一组记录生成聚类字段的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用PostgreSQL为符合条件的一组记录生成聚类字段 [英] Generate cluster field for a set of records which match the condition(s) using PostgreSQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用PostgreSQL为符合条件的一组记录生成聚类字段 [英] Generate cluster field for a set of records which match the condition(s) using PostgreSQL

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭