如何合并两个 id 列,确定哪些行属于同一组相关 ID [英] How to consolidate two id columns, identifying which rows belong to same set of related IDs

查看:19
本文介绍了如何合并两个 id 列,确定哪些行属于同一组相关 ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 2 个独立创建/收集的 ID 列.我试图通过根据两个 ID 列中的任何一个确定哪些行是同一组相关 ID 的一部分,将这两个 ID 列合并为一个.我会根据一些规则认为行是相关的:

1:如果一个 LOAN 在多行中具有相同的值,则它们属于同一组(在示例中仅供参考.)我将其称为loan_group.这里没有问题.

2:如果一个 COLLATERAL 在多行中具有相同的值,则它们属于临时组.我称它为抵押_组(与 #1 相同的规则.)这里没有问题.

3:最后,我不知道如何准确表达这一点,但是任何时候在属于同一组(跨贷款和抵押品列)的值之间存在重叠时,应进一步合并这些组.例如:

LOAN COLLATERAL Loan_group concurrent_group final_grouping---- ----------- ---------- ---------------- --------------L1 C1* 1 1 **1**L2** C1* 2 1 **1**L5 C8 3 2 2L2** C4*** 2 3 **1**L6 C8 4 2 2L7 C9 5 4 3L8 C4*** 6 3 **1**

*因为第 1 行和第 2 行都具有值 C1,它们将被分配到相同的最终分组

**因为第 2 行的 LOAN 值是 L2,这意味着我们可以将第 4 行包含在合并的最终分组中.该行可以通过 L2/C1 链接链接回第 1 行

***最后,因为第 4 行包含 COLLATERAL 值 C4,这意味着我们可以将第 7 行包含在合并的最终分组中.该行可以通过 L2/C4 & 链接回第一行.L2/C1 链接

该数据集大约是 15m LOAN + COLLATERAL 的独特组合.在某些边缘情况下,这些组可能会交叉几千(可能+10,000)个 ID.我在 BQ 测试一些解决方案时遇到了一些资源问题(但这些问题主要是因为我对 BQ 缺乏经验.)如果这会影响任何人的建议,请提一下.

非常感谢您的时间,对于在我的第一个版本中过于模糊/简短表示歉意...

解决方案

以下为 BigQuery Standard SQL

正如 Gordon 在评论中提到的那样 - BigQuery 没有对递归 CTE 或分层查询的本机支持,因此这不能仅通过单个查询来完成!

但是...,这可以使用最近引入的脚本来实现 如下面的例子

DECLARE rows_count, run_away_stop INT64 DEFAULT 0;创建临时表输入为 (选择L1"贷款,C1"抵押品 UNION ALL选择 'L2', 'C1' 联合所有选择L5"、C8"联合所有选择L2"、C4"联合所有选择L6"、C8"联合所有选择L7"、C9"联合所有选择L8"、C4");创建临时表 initial_grouping ASSELECT ARRAY_AGG(抵押品ORDER BY抵押品)arr从输入按贷款分组;环形SET rows_count = (SELECT COUNT(1) FROM initial_grouping);SET run_away_stop = run_away_stop + 1;创建或替换临时表 initial_grouping ASSELECT ANY_VALUE(arr) arr FROM (SELECT ARRAY(SELECT DISTINCT val FROM UNNEST(arr) val ORDER BY val) arr从 (SELECT ANY_VALUE(arr1) arr1, ARRAY_CONCAT_AGG(arr) arr从 (SELECT t1.arr arr1, t2.arr arr2, ARRAY(SELECT DISTINCT val FROM UNNEST(ARRAY_CONCAT(t1.arr, t2.arr)) val ORDER BY val) arr从 initial_grouping t1,initial_grouping t2WHERE (SELECT COUNT(1) FROM UNNEST(t1.arr) val JOIN UNNEST(t2.arr) val USING(val)) >0) 按格式分组('%t', arr1))) 按格式分组('%t', arr);IF (rows_count = (SELECT COUNT(1) FROM initial_grouping) AND run_away_stop > 1) OR run_away_stop >10 然后打破;万一;结束循环;从输入中选择贷款、抵押品、final_groupingJOIN (SELECT ROW_NUMBER() OVER() final_grouping, arr FROM initial_grouping)ON 抵押品 IN UNNEST(arr)ORDER BY 贷款、抵押品;

上面的脚本产生下面的结果(我相信这正是你要找的)

行贷款抵押品final_grouping1 L1 C1 12 L2 C1 13 L2 C4 14 L5 C8 35 L6 C8 36 L7 C9 27 L8 C4 1

请注意:当应用于真实数据时 - 确保为 run_away_stop 设置适当的最大值(在上面的脚本中它是 10 - 请参阅 LOOP 中的最后一条语句 - 您可能需要增加它以确保转换完成)​​

最后:应用于您的真实表:

1 - 删除 CREATE TEMP TABLE input (...) 语句
2 - 在 CREATE TEMP TABLE initial_grouping AS ... 语句中用 your_project.your_dataset.your_table 替换 input

I have 2 ID columns that are created/collected independently. I'm trying to consolidate these two ID columns into one by determining which rows are part of the same related group of ids based on either of the two ID columns. I would consider the rows to be related based on a few rules:

1: If a LOAN has the same value in multiple rows, they belong to the same group (in the example for reference only.) I've called it loan_group. No issues here.

2: If a COLLATERAL has the same value in multiple rows, they belong to the temporary group. I've called it collateral_group (same rule as #1.) No issues here.

3: Finally, and I'm not sure how to phrase this exactly, but any time there is overlap between values that are part of the same group (across loan and collateral columns), those groups should be further consolidated. For example:

LOAN  COLLATERAL  loan_group  collateral_group  final_grouping
----  ----------- ----------  ----------------  --------------
L1    C1*         1           1                 **1**
L2**  C1*         2           1                 **1**
L5    C8          3           2                 2
L2**  C4***       2           3                 **1**
L6    C8          4           2                 2
L7    C9          5           4                 3
L8    C4***       6           3                 **1**

*because rows 1 and 2 both have the value C1, they would be assigned to the same final grouping

**because row 2 has the LOAN value L2, this means we can include row 4 in the consolidated final grouping. That row can be linked back to row 1 via the L2/C1 link

***finally, because row 4 includes the COLLATERAL value C4, this means we can include row 7 in the consolidated final grouping. That row can be linked back to row one via the L2/C4 & L2/C1 links

The data set is roughly 15m unique combinations of LOAN + COLLATERAL. The groups will likely crossover a few thousand (maybe +10 thousand) IDs in some edge cases. I've run into some resource issues on BQ testing some solutions (but those issues are mostly a do with my inexperience with BQ.) Just a heads up if that impacts anybody's recommendation.

Really appreciate your time, apologies for being overly vague/brief in my first version...

解决方案

Below is for BigQuery Standard SQL

As Gordon mentioned in comments - BigQuery doesn't have native support for recursive CTEs or hierarchical queries, so this cannot be done with just a single query!

BUT ..., this can be implemented using recently introduced scripting as in example below

DECLARE rows_count, run_away_stop INT64 DEFAULT 0;

CREATE TEMP TABLE input AS (
  SELECT 'L1' loan, 'C1' collateral UNION ALL
  SELECT 'L2', 'C1' UNION ALL
  SELECT 'L5', 'C8' UNION ALL
  SELECT 'L2', 'C4' UNION ALL
  SELECT 'L6', 'C8' UNION ALL
  SELECT 'L7', 'C9' UNION ALL
  SELECT 'L8', 'C4'
);

CREATE TEMP TABLE initial_grouping AS 
SELECT ARRAY_AGG(collateral ORDER BY collateral) arr 
FROM input
GROUP BY loan;

LOOP
  SET rows_count = (SELECT COUNT(1) FROM initial_grouping);
  SET run_away_stop = run_away_stop + 1;

  CREATE OR REPLACE TEMP TABLE initial_grouping AS
  SELECT ANY_VALUE(arr) arr FROM (
    SELECT ARRAY(SELECT DISTINCT val FROM UNNEST(arr) val ORDER BY val) arr
    FROM (
      SELECT ANY_VALUE(arr1) arr1, ARRAY_CONCAT_AGG(arr) arr    
      FROM (
        SELECT t1.arr arr1, t2.arr arr2, ARRAY(SELECT DISTINCT val FROM UNNEST(ARRAY_CONCAT( t1.arr, t2.arr)) val ORDER BY val) arr 
        FROM initial_grouping t1, initial_grouping t2 
        WHERE (SELECT COUNT(1) FROM UNNEST(t1.arr) val JOIN UNNEST(t2.arr) val USING(val)) > 0
      ) GROUP BY FORMAT('%t', arr1)
    )
  ) GROUP BY FORMAT('%t', arr);

  IF (rows_count = (SELECT COUNT(1) FROM initial_grouping) AND run_away_stop > 1) OR run_away_stop > 10 THEN BREAK; END IF;
END LOOP;

SELECT loan, collateral, final_grouping FROM input 
JOIN (SELECT ROW_NUMBER() OVER() final_grouping, arr FROM initial_grouping) 
ON collateral IN UNNEST(arr) 
ORDER BY loan, collateral; 

Above script produces below result (which I believe is exactly what you are looking for)

Row loan    collateral  final_grouping   
1   L1      C1          1    
2   L2      C1          1    
3   L2      C4          1    
4   L5      C8          3    
5   L6      C8          3    
6   L7      C9          2    
7   L8      C4          1    

Please note: when applying to real data - make sure you set appropriate max for run_away_stop (in above script it is 10 - see last statement within LOOP - you might need to increase it to make sure conversion will complete)

Finally: to apply to your real table:

1 - remove CREATE TEMP TABLE input (...) statement
2 - replace input with your_project.your_dataset.your_table in CREATE TEMP TABLE initial_grouping AS ... statement

这篇关于如何合并两个 id 列,确定哪些行属于同一组相关 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆