连接的组件 [英] Connected Components

查看:92
本文介绍了连接的组件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据,这些数据是通过将相似的子项目匹配在一起,然后按类别"将这些相似的项目分组来创建的.

现在,必须以这样的方式匹配结果类别,即在每个"group_id"内将相关类别分组在一起.在下面的示例中,一个匹配项是A-> B-> C-> D-> E-> F-> G,这是通过逐行重复获得的.

我已经发布了当前答案,该答案适用于此简单数据集,但由于实际数据集包含最多1M行,每个"group_id"最多可能有60个类别,此查询会导致实际数据出现超出假脱机空间"错误.

请注意:

  • 由于公司的限制,我不能使用存储过程.
  • 我不能使用用户定义函数(UDF)
  • 我不能使用用户定义类型(UDT)

正确答案将会

  • 为Teradata或兼容版本编写
  • 比我的答案更有效率
  • 尊重我上面提到的限制

样本输入:

所需的输出:

解决方案

您需要递归方法,但是您的WITH RECURSIVE创建了一个巨大的中间结果,导致没有更多的假脱机.

对于类似的过程,我使用以下方法(最初在存储过程中使用WHILE循环):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

现在重复以下更新,直到0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

要获得相关类别,您最终需要:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

要与您的预期结果完全匹配,您需要添加DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt

I have a set of data that has been created by matching together similar sub-items, and then GROUPing these similar items by "category".

Now, the resultant categories must be matched in such a way that groups related categories together within each "group_id". In the example below, one match is A->B->C->D->E->F->G, which is obtained by recursing through rows.

I've posted my current answer, which works on this simple data set, but because the actual data set contains up to 1M rows, and there may be up to 60 categories per "group_id," this query causes an "out of spool space" error on real data.

Please note:

  • Due to company restrictions, I cannot use stored procedures.
  • I can't use user defined functions (UDFs)
  • I can't use user defined types (UDTs)

A correct answer will

  • Be written for Teradata or compatible
  • Be more efficient than my answer
  • Respect the restrictions I mention above

Sample Input:

Desired Output:

解决方案

You need a recursive apporach, but your WITH RECURSIVE creates a humongous intermediate result, which leads to no more spool.

For a similar process I used following approach (originally USING A WHILE-loop in a Stored Procedure):

CREATE MULTISET VOLATILE TABLE vt_tmp, NO Log  AS
 (
  SELECT group_id, category_1, category_2, 
     -- assign a unique number to 
     Dense_Rank() Over (ORDER BY group_id, category_1) AS rnk

  -- remove when you source data is unique
  GROUP BY 1,2,3 -- same result as a DISTINCT, but processed before DENSE_RANK
  FROM match_detail 
 )
WITH DATA
PRIMARY INDEX (category_2)
ON COMMIT PRESERVE ROWS;

Now repeat the following update until 0 rows processed:

-- find matching categories and assign them a common number    
UPDATE vt_tmp FROM
 ( SELECT e2.group_id, e2.category_1, Min(e1.rnk) AS minrnk
   FROM vt_tmp e1 JOIN vt_tmp e2
   ON e1.category_2 = e2.category_2
   AND e1.rnk < e2.rnk
   GROUP BY e2.group_id, e2.category_1
 ) x
SET rnk = minrnk
WHERE 
  vt_tmp.group_id = x.group_id
AND vt_tmp.category_1 = x.category_1
;

To get the related categories you finally need:

SELECT group_id, category_1 AS category, rnk AS related_categories
FROM vt_tmp
UNION
SELECT group_id, category_2, rnk 
FROM vt_tmp

For an exact match of your expected result you need to add a DENSE_RANK:

SELECT group_id, category, Dense_Rank() Over (PARTITION BY group_id ORDER BY related_categories)
FROM
 (
   SELECT group_id, category_1 AS category, rnk AS related_categories
   FROM vt_tmp
   UNION
   SELECT group_id, category_2, rnk 
   FROM vt_tmp
 ) AS dt

这篇关于连接的组件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆