如何合并两个ID列,确定哪些行属于同一组相关ID [英] How to consolidate two id columns, identifying which rows belong to same set of related IDs

查看:90
本文介绍了如何合并两个ID列,确定哪些行属于同一组相关ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个独立创建/收集的ID列.我试图通过基于两个ID列中的任何一个来确定哪些行是同一相关ID组的一部分,从而将这两个ID列合并为一个.我会根据一些规则将行关联起来:

I have 2 ID columns that are created/collected independently. I'm trying to consolidate these two ID columns into one by determining which rows are part of the same related group of ids based on either of the two ID columns. I would consider the rows to be related based on a few rules:

1:如果LOAN在多行中具有相同的值,则它们属于同一组(在示例中仅供参考.)我将其称为loan_group.这里没有问题.

1: If a LOAN has the same value in multiple rows, they belong to the same group (in the example for reference only.) I've called it loan_group. No issues here.

2:如果COLLATERAL在多行中具有相同的值,则它们属于临时组.我称它为collat​​eral_group(与#1规则相同.)在这里没有问题.

2: If a COLLATERAL has the same value in multiple rows, they belong to the temporary group. I've called it collateral_group (same rule as #1.) No issues here.

3:最后,我不确定如何准确地表述这一点,但是只要属于同一组的值(在贷款和抵押列中)之间存在重叠,则应进一步合并这些组.例如:

3: Finally, and I'm not sure how to phrase this exactly, but any time there is overlap between values that are part of the same group (across loan and collateral columns), those groups should be further consolidated. For example:

LOAN  COLLATERAL  loan_group  collateral_group  final_grouping
----  ----------- ----------  ----------------  --------------
L1    C1*         1           1                 **1**
L2**  C1*         2           1                 **1**
L5    C8          3           2                 2
L2**  C4***       2           3                 **1**
L6    C8          4           2                 2
L7    C9          5           4                 3
L8    C4***       6           3                 **1**

**因为第1行和第2行的值均为C1,所以它们将被分配给相同的最终分组

*because rows 1 and 2 both have the value C1, they would be assigned to the same final grouping

**因为第2行的LOAN值为L2,这意味着我们可以将第4行包括在合并的最终分组中.该行可以通过L2/C1链接链接回到第1行

**because row 2 has the LOAN value L2, this means we can include row 4 in the consolidated final grouping. That row can be linked back to row 1 via the L2/C1 link

***最后,由于第4行包含COLLATERAL值C4,因此这意味着我们可以在合并的最终分组中包含第7行.该行可以通过L2/C4& A链接回第一行. L2/C1链接

***finally, because row 4 includes the COLLATERAL value C4, this means we can include row 7 in the consolidated final grouping. That row can be linked back to row one via the L2/C4 & L2/C1 links

数据集大约是贷款+抵押的15m唯一组合.在某些情况下,这些小组可能会交叉使用几千个(也许是一万个)ID.我在BQ测试某些解决方案时遇到了一些资源问题(但是这些问题主要是由于我对BQ的缺乏经验所致.)请注意,如果这会影响任何人的建议.

The data set is roughly 15m unique combinations of LOAN + COLLATERAL. The groups will likely crossover a few thousand (maybe +10 thousand) IDs in some edge cases. I've run into some resource issues on BQ testing some solutions (but those issues are mostly a do with my inexperience with BQ.) Just a heads up if that impacts anybody's recommendation.

非常感谢您的时间,在我的第一个版本中过分含糊/简短地道歉...

Really appreciate your time, apologies for being overly vague/brief in my first version...

推荐答案

以下是BigQuery标准SQL

Below is for BigQuery Standard SQL

正如戈登在评论中所提到的-BigQuery不对递归CTE或层次结构查询提供本机支持,因此仅凭单个查询就无法做到这一点!

As Gordon mentioned in comments - BigQuery doesn't have native support for recursive CTEs or hierarchical queries, so this cannot be done with just a single query!

但是...,可以使用最近引入的脚本来实现如下面的示例

BUT ..., this can be implemented using recently introduced scripting as in example below

DECLARE rows_count, run_away_stop INT64 DEFAULT 0;

CREATE TEMP TABLE input AS (
  SELECT 'L1' loan, 'C1' collateral UNION ALL
  SELECT 'L2', 'C1' UNION ALL
  SELECT 'L5', 'C8' UNION ALL
  SELECT 'L2', 'C4' UNION ALL
  SELECT 'L6', 'C8' UNION ALL
  SELECT 'L7', 'C9' UNION ALL
  SELECT 'L8', 'C4'
);

CREATE TEMP TABLE initial_grouping AS 
SELECT ARRAY_AGG(collateral ORDER BY collateral) arr 
FROM input
GROUP BY loan;

LOOP
  SET rows_count = (SELECT COUNT(1) FROM initial_grouping);
  SET run_away_stop = run_away_stop + 1;

  CREATE OR REPLACE TEMP TABLE initial_grouping AS
  SELECT ANY_VALUE(arr) arr FROM (
    SELECT ARRAY(SELECT DISTINCT val FROM UNNEST(arr) val ORDER BY val) arr
    FROM (
      SELECT ANY_VALUE(arr1) arr1, ARRAY_CONCAT_AGG(arr) arr    
      FROM (
        SELECT t1.arr arr1, t2.arr arr2, ARRAY(SELECT DISTINCT val FROM UNNEST(ARRAY_CONCAT( t1.arr, t2.arr)) val ORDER BY val) arr 
        FROM initial_grouping t1, initial_grouping t2 
        WHERE (SELECT COUNT(1) FROM UNNEST(t1.arr) val JOIN UNNEST(t2.arr) val USING(val)) > 0
      ) GROUP BY FORMAT('%t', arr1)
    )
  ) GROUP BY FORMAT('%t', arr);

  IF (rows_count = (SELECT COUNT(1) FROM initial_grouping) AND run_away_stop > 1) OR run_away_stop > 10 THEN BREAK; END IF;
END LOOP;

SELECT loan, collateral, final_grouping FROM input 
JOIN (SELECT ROW_NUMBER() OVER() final_grouping, arr FROM initial_grouping) 
ON collateral IN UNNEST(arr) 
ORDER BY loan, collateral; 

上面的脚本会产生以下结果(我相信这正是您要寻找的结果)

Above script produces below result (which I believe is exactly what you are looking for)

Row loan    collateral  final_grouping   
1   L1      C1          1    
2   L2      C1          1    
3   L2      C4          1    
4   L5      C8          3    
5   L6      C8          3    
6   L7      C9          2    
7   L8      C4          1    

请注意:应用于真实数据时-确保为run_away_stop设置适当的最大值(在上面的脚本中为10-请参阅LOOP中的最后一条语句-您可能需要增加它以确保转换将完成)

Please note: when applying to real data - make sure you set appropriate max for run_away_stop (in above script it is 10 - see last statement within LOOP - you might need to increase it to make sure conversion will complete)

最后:应用于您的真实表:

Finally: to apply to your real table:

1-删除CREATE TEMP TABLE input (...)语句
2-在CREATE TEMP TABLE initial_grouping AS ...语句

1 - remove CREATE TEMP TABLE input (...) statement
2 - replace input with your_project.your_dataset.your_table in CREATE TEMP TABLE initial_grouping AS ... statement

这篇关于如何合并两个ID列,确定哪些行属于同一组相关ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆