在postgresql中搜索跨字段重复项并带回匹配的对 [英] search for cross-field duplicates in postgresql and bring back matched pairs
问题描述
我有一张联系表。该表包含mobile_phone列和home_phone列。我想获取所有重复的联系人对,其中一对是两个共享一个电话号码的联系人。
I have a table of contacts. The table contains a mobile_phone column as well as a home_phone column. I'd like to fetch all pairs of duplicate contacts where a pair is two contacts sharing a phone number.
请注意,如果联系人A的mobile_phone与联系人B的home_phone匹配,那么这也是重复的。
这是应该匹配的三个联系人的示例。
Note that if contact A's mobile_phone matches contact B's home_phone, this is also a duplicate. Here is an example of three contacts that should match.
contact_id|mobile_phone|home_phone|other columns such as email.......|...
-------------------------------------------------------------------------
111 |9748777777 |1112312312|..................................|...
112 |1112312312 |null |..................................|...
113 |9748777777 |0001112222|..................................|...
具体来说,我想带回一张表,其中每行包含两个匹配联系人的contact_id。例如,
Specifically, I would like to bring back a table where each row contains the contact_ids of the two matching contacts. For example,
||contact_id_a|contact_id_b||
||-------------------------||
|| 145155 | 145999 ||
|| 145158 | 145141 ||
在@Erwin的帮助下,这里在此处输入链接说明我能够编写与我要实现的查询接近的查询,从而带回了该列表中所有联系人的contact_id列表与列表中的其他联系人共享电话号码。
With the help of @Erwin here enter link description here I was able to write a query close to what I am trying to achieve brings back a list of contact_ids of all contacts in the list that share a phone number with other contacts in the list.
SELECT c.contact_id
FROM contacts c
WHERE EXISTS (
SELECT FROM contacts x
WHERE (x.data->>'mobile_phone' is not null and x.data->>'mobile_phone' IN (c.data->>'mobile_phone', c.data->>'home_phone'))
OR (x.data->>'home_phone' is not null and x.data->>'home_phone' IN (c.data->>'mobile_phone', c.data->>'home_phone'))
AND x.contact_id <> c.contact_id -- except self
);
输出仅包含如下的contact_id ...
The output only contains contact_ids like this...
||contact_id||
--------------
|| 2341514 ||
|| 345141 ||
我想将匹配联系人的contact_ids单行显示,如上所示。
I'd like to bring back the contact_ids of matching contacts in a single row as shown above.
推荐答案
一个简单的查询就是使用 ARRAY重叠运算符&&
:
A simple query would be with the ARRAY overlap operator &&
:
SELECT c1.contact_id AS a, c2.contact_id AS b
FROM contacts c1
JOIN contacts c2 ON c1.contact_id < c2.contact_id
WHERE ARRAY [c1.mobile_phone, c1.home_phone] && ARRAY[c2.mobile_phone, c2.home_phone];
条件 c1.contact_id< c2.contact_id
不包括自连接和切换重复项。
The condition c1.contact_id < c2.contact_id
excludes self-joins and switched duplicates.
但是,如果许多联系人以某种方式共享相同的号码,这种表示很快就会失控。
But this representation gets out of hand quickly if many contacts share the same number some way.
放在一边: [INNER] JOIN
和 WHERE
的条件完全消耗掉了相同但不超过 <$涉及到c $ c> join_collapse_limit 连接。请参阅:
Aside: conditions of an [INNER] JOIN
and WHERE
conditions burn down doing exactly the same while no more than join_collapse_limit
joins are involved. See:
- Count on join of big tables with conditions is slow
这篇关于在postgresql中搜索跨字段重复项并带回匹配的对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!