从链接的值中查找唯一的用户 [英] Finding unique users from linked values

查看:57
本文介绍了从链接的值中查找唯一的用户的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的表格中有此值.

id | val1  | val2 
--------------------
1  |   e1  |   m1
2  |   e1  |   m2
3  |   e2  |   m2
4  |   e3  |   m1
5  |   e4  |   m3
6  |   e5  |   m3
7  |   e5  |   m4
8  |   e4  |   m5

由此,我必须恢复像这样的唯一用户,并为他们提供一个唯一的ID进行标识.

User1->(val1:e1,e2,e3 | val2:m1,m2)

e1<-> m1,e1<-> m2,m1<-> e3,e2<-> m2(<->表示链接).

e1连接到m1.

e1连接到m2.

m2已连接到e2.

所以e1,m1连接到e2.

类似地,我们发现e1,e2,e3,m1,m2都链接在一起.我们需要识别这些链.


User2->(val1:e4,e5 | val2:m3,m4,m5)

我已经基于对val1进行分组,然后分别对val2进行分组并将它们连接到代码(Java)中,从而编写了两个查询.

我希望它直接在MySQL/BigQuery查询本身中执行此操作,因为我们正在为此构建一些报告.

在单个查询中有可能吗?请帮忙.

谢谢.

更新:

所需的输出-

[
 { 
   id : user1,
   val1 : [e1, e2, e3],
   val2 : [m1, m2]
 },
 { 
   id : user2,
   val1 : [e4, e5],
   val2 : [m3, m4, m5]
 }
]

id | val1  | val2 | UUID
------------------------
1  |   e1  |   m1 | u1
2  |   e1  |   m2 | u1
3  |   e2  |   m2 | u1
4  |   e3  |   m1 | u1
5  |   e4  |   m3 | u2
6  |   e5  |   m3 | u2
7  |   e5  |   m4 | u2
8  |   e4  |   m5 | u2

为简单起见,假设val1和val2的值是节点,并且如果存在于同一行中则进行连接.

表中的行形成图形(user1,user2),我们需要标识这些图形.

解决方案

希望通过纯BigQuery(Standard SQL)解决方案来跳入

先决条件/假设:源数据在sandbox.temp.id1_id2_pairs
中 您应该用自己的替换它,或者如果您想用问题中的伪数据进行测试-您可以按如下方式创建此表(当然用您自己的project.dataset替换sandbox.temp)


确保设置了相应的目标表

注意:您可以在此答案的底部找到所有相应的查询(以文本形式),但现在我将用屏幕截图说明我的答案-所有内容均已显示-查询,结果和使用的选项

因此,将分三个步骤:

步骤1-初始化

在这里,我们只是基于与id2的连接对id1进行初始分组:

如您在此处看到的-我们基于通过id2的简单一级连接创建了具有相应连接的所有id1值的列表

输出表为sandbox.temp.groups

第2步-分组迭代

在每次迭代中,我们将基于已建立的组来丰富分组.
查询源是上一步(sandbox.temp.groups)的输出表,目标是具有覆盖

的同一表(sandbox.temp.groups)

我们将继续迭代,直到找到的组数与上一次迭代相同为止

注意:您可以打开两个BigQuery Web UI标签(如上图所示),而无需更改任何代码,只需运行分组,然后一次又一次地检查直到迭代收敛

(对于我在先决条件部分中使用的特定数据-我进行了3次迭代-第一次迭代产生了5个用户,第二次迭代产生了3个用户,第三次迭代又产生了3个用户-这表明我们已经完成了迭代. >

当然,在现实生活中-迭代次数可能不止3次-因此我们需要某种自动化方式(请参见答案底部的相​​应部分).

第3步– 最终分组
id1分组完成后-我们可以为id2添加最终分组

现在的最终结果在sandbox.temp.users

使用的查询(不要忘记根据上述逻辑和屏幕截图设置相应的目标表并在需要时进行覆盖):

先决条件:

 #standardSQL
SELECT 1 id, 'e1' id1, 'm1' id2 UNION ALL
SELECT 2,    'e1',     'm2' UNION ALL
SELECT 3,    'e2',     'm2' UNION ALL
SELECT 4,    'e3',     'm1' UNION ALL
SELECT 5,    'e4',     'm3' UNION ALL
SELECT 6,    'e5',     'm3' UNION ALL
SELECT 7,    'e5',     'm4' UNION ALL
SELECT 8,    'e4',     'm5' UNION ALL
SELECT 9,    'e6',     'm6' UNION ALL
SELECT 9,    'e7',     'm7' UNION ALL
SELECT 9,    'e2',     'm6' UNION ALL
SELECT 888,  'e4',     'm55'   
 

第1步

 #standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1
), x2 AS (SELECT id2, STRING_AGG(id1) id1s FROM `yourTable` GROUP BY id2 
), x3 AS (
  SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
    SELECT DISTINCT i FROM UNNEST(SPLIT(id1s)) i)) grp
  FROM (
    SELECT x1.id1 id, STRING_AGG((id1s)) id1s FROM x1 CROSS JOIN x2
    WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(id1s)) y WHERE x1.id1 = y)
    GROUP BY id1) 
)
SELECT * FROM x3 
 

第2步-分组

 #standardSQL
WITH x3 AS (select * from `sandbox.temp.groups`)
SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
  SELECT DISTINCT i FROM UNNEST(SPLIT(grp)) i)) grp
FROM (
  SELECT a.id, STRING_AGG(b.grp) grp FROM x3 a CROSS JOIN x3 b 
  WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(b.grp)) y WHERE a.id = y)
  GROUP BY a.id )   
 

第2步-检查

 #standardSQL
SELECT COUNT(DISTINCT grp) users FROM `sandbox.temp.groups` 
 

第3步

 #standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1 
), x3 as (select * from `sandbox.temp.groups`
), f  AS (SELECT DISTINCT grp FROM x3 ORDER BY grp
)
SELECT ROW_NUMBER() OVER() id, grp id1, 
  (SELECT STRING_AGG(i ORDER BY i) FROM (SELECT DISTINCT i FROM UNNEST(SPLIT(id2)) i)) id2
FROM (
  SELECT grp, STRING_AGG(id2s) id2 FROM f 
  CROSS JOIN x1 WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(f.grp)) y WHERE id1 = y)
  GROUP BY grp)
 

自动化:
当然,如果迭代快速收敛,则可以手动执行上述过程",因此最终将运行10-20次.但是在现实生活中,您可以使用以下任何 client 轻松地自动执行此操作您的选择

I have values in my Table of this form.

id | val1  | val2 
--------------------
1  |   e1  |   m1
2  |   e1  |   m2
3  |   e2  |   m2
4  |   e3  |   m1
5  |   e4  |   m3
6  |   e5  |   m3
7  |   e5  |   m4
8  |   e4  |   m5

From this, I have to recover unique users like this and give them a unique id to identify.

User1 -> (val1 : e1, e2, e3 | val2: m1, m2)

e1 <-> m1, e1 <-> m2, m1 <-> e3, e2 <-> m2 ( <-> means linked).

e1 is connected to m1.

e1 is connected to m2.

m2 is connected to e2.

So e1,m1 are connected to e2.

Similarly, we find e1, e2, e3, m1, m2 all are linked. We need to identify these chains.


User2 -> (val1 : e4, e5 | val2: m3, m4, m5)

I have written two queries based on grouping my val1 and then by val2 separately and joining them in code (Java).

I want this to do this directly in MySQL/BigQuery query itself as we are building some reports on this.

Is this possible in a single query? Please help.

Thank you.

Update :

Desired output -

[
 { 
   id : user1,
   val1 : [e1, e2, e3],
   val2 : [m1, m2]
 },
 { 
   id : user2,
   val1 : [e4, e5],
   val2 : [m3, m4, m5]
 }
]

or

id | val1  | val2 | UUID
------------------------
1  |   e1  |   m1 | u1
2  |   e1  |   m2 | u1
3  |   e2  |   m2 | u1
4  |   e3  |   m1 | u1
5  |   e4  |   m3 | u2
6  |   e5  |   m3 | u2
7  |   e5  |   m4 | u2
8  |   e4  |   m5 | u2

To make it simple, assuming values of val1 and val2 are nodes and are connected if present in the same row.

The rows of the table form graphs (user1, user2) and we need to identify these graphs.

解决方案

Wanted to jump-in with option of solving your task with pure BigQuery (Standard SQL)

Pre-requisites / assumptions: source data is in sandbox.temp.id1_id2_pairs
You should replace this with your own or if you want to test with dummy data from your question - you can create this table as below (of course replace sandbox.temp with your own project.dataset)


Make sure you set respective destination table

Note: you can find all respective Queries (as text) at the bottom of this answer, but for now I am illustrating my answer with screenshots - so all is presented - query, result and used options

So, there will be three steps:

Step 1 - Initialization

Here, we just do initial grouping of id1 based on connections with id2:

As you can see here - we created list of all id1 values with respective connections based on simple one-level connection through id2

Output table is sandbox.temp.groups

Step 2 - Grouping Iterations

In each iteration we will enrich grouping based on already established groups.
Source of Query is output table of previous Step (sandbox.temp.groups) and Destination is the same table (sandbox.temp.groups) with Overwrite

We will continue iterations till when count of found groups will be the same as in previous iteration

Note: you can just have two BigQuery Web UI Tabs opened (as it is shown above) and without changing any code just run Grouping and then Check again and again till iteration converge

(for specific data that I used in pre-requisites section - I had three iterations - first iteration produced 5 users, second iteration produced 3 users and third iteration produced again 3 users - which indicated that we done with iterations.

Of course, in real life case - number of iterations could be more than just three - so we need some sort of automation (see respective section at the bottom of answer).

Step 3 – Final Grouping
When id1 grouping is completed - we can add final grouping for id2

Final result now is in sandbox.temp.users table

Used Queries (do not forget to set respective destination tables and overwrites when needed as per above described logic and screenshots):

Pre-requisites:

#standardSQL
SELECT 1 id, 'e1' id1, 'm1' id2 UNION ALL
SELECT 2,    'e1',     'm2' UNION ALL
SELECT 3,    'e2',     'm2' UNION ALL
SELECT 4,    'e3',     'm1' UNION ALL
SELECT 5,    'e4',     'm3' UNION ALL
SELECT 6,    'e5',     'm3' UNION ALL
SELECT 7,    'e5',     'm4' UNION ALL
SELECT 8,    'e4',     'm5' UNION ALL
SELECT 9,    'e6',     'm6' UNION ALL
SELECT 9,    'e7',     'm7' UNION ALL
SELECT 9,    'e2',     'm6' UNION ALL
SELECT 888,  'e4',     'm55'   

Step 1

#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1
), x2 AS (SELECT id2, STRING_AGG(id1) id1s FROM `yourTable` GROUP BY id2 
), x3 AS (
  SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
    SELECT DISTINCT i FROM UNNEST(SPLIT(id1s)) i)) grp
  FROM (
    SELECT x1.id1 id, STRING_AGG((id1s)) id1s FROM x1 CROSS JOIN x2
    WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(id1s)) y WHERE x1.id1 = y)
    GROUP BY id1) 
)
SELECT * FROM x3 

Step 2 - Grouping

#standardSQL
WITH x3 AS (select * from `sandbox.temp.groups`)
SELECT id, (SELECT STRING_AGG(i ORDER BY i) FROM (
  SELECT DISTINCT i FROM UNNEST(SPLIT(grp)) i)) grp
FROM (
  SELECT a.id, STRING_AGG(b.grp) grp FROM x3 a CROSS JOIN x3 b 
  WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(b.grp)) y WHERE a.id = y)
  GROUP BY a.id )   

Step 2 - Check

#standardSQL
SELECT COUNT(DISTINCT grp) users FROM `sandbox.temp.groups` 

Step 3

#standardSQL
WITH `yourTable` AS (select * from `sandbox.temp.id1_id2_pairs`
), x1 AS (SELECT id1, STRING_AGG(id2) id2s FROM `yourTable` GROUP BY id1 
), x3 as (select * from `sandbox.temp.groups`
), f  AS (SELECT DISTINCT grp FROM x3 ORDER BY grp
)
SELECT ROW_NUMBER() OVER() id, grp id1, 
  (SELECT STRING_AGG(i ORDER BY i) FROM (SELECT DISTINCT i FROM UNNEST(SPLIT(id2)) i)) id2
FROM (
  SELECT grp, STRING_AGG(id2s) id2 FROM f 
  CROSS JOIN x1 WHERE EXISTS (SELECT y FROM UNNEST(SPLIT(f.grp)) y WHERE id1 = y)
  GROUP BY grp)

Automation:
Of course, above "process" can be executed manually in case if iterations converge fast - so you will end up with 10-20 runs. But in more real-life cases you can easily automate this with any client of your choice

这篇关于从链接的值中查找唯一的用户的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆