识别给定边的连接图 [英] identifying connected graphs given edges

查看:77
本文介绍了识别给定边的连接图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我如何将甚至是间接相关的人归为一组?具体来说,使用如下所示的数据集的前两列,我如何在SAS中(可能使用DATA步骤或PROC SQL)以编程方式导出第三列?有非迭代算法吗?

How do I group people who are related, even indirectly? Concretely, using the first two columns of the data set like below, how do I in SAS (maybe using a DATA step or PROC SQL) programmatically derive the third column? Is there a non-iterative algorithm?

背景:每个人都有多个地址.通过每个地址,每个人都连接到零个或多个人.如果连接了两个人,则他们将获得相同的组ID.如果人A直接连接到B,而人B连接到C,则人A,B和C共享一个组.

Background: Each person has multiple addresses. Through each address, each person is connected to zero or more persons. If two people are connected, they get the same group ID. If person A is directly connected to B and B is connected to C, then persons A, B, and C share a group.

data people; input person_id address_id $ household_id; datalines; 1 A 1 2 B 2 3 B 2 4 C 3 5 C 3 5 D 3 6 D 3 ;

data people; input person_id address_id $ household_id; datalines; 1 A 1 2 B 2 3 B 2 4 C 3 5 C 3 5 D 3 6 D 3 ;

推荐答案

查找图的所有连接组件的常规方法是广度优先搜索深度优先搜索. SAS不是实现此类算法的最佳工具,因为它们需要使用诸如队列之类的数据结构.

The general methods for finding all connected components of a graph are Breadth-First-Search or Depth-First-Search. SAS is not the best tool for implementing such algorithms since they require using such data structures as queues.

仍然可以使用哈希对象来完成.这是用于BF搜索的代码.

Still it can be done with hash objects. Here's the code for BF-search.

data people;
  input person_id address_id $ household_id;
  datalines;
1 A 1
2 B 2
3 B 2
4 C 3
5 C 3
5 D 3
6 D 3
;
run;

创建邻接表-具有共同地址的所有成对的人.还有空变量cluster,稍后将使用组的ID填充该变量:

Create adjacency list - all pairs of people with a common address. And empty variable cluster which will be populated later with groups' IDs:

proc sql;
  create table connections as
  select distinct a.person_id as person_id_a, b.person_id as person_id_b, . as cluster
  from people a
  inner join people b
  on a.address_id=b.address_id
;
quit;

这里是BF搜索本身:

data _null_;

为所有唯一的人(图形的顶点)声明哈希对象及其迭代器:

Declare hash object and its iterator for all unique people (vertices of the graph):

  if 0 then set Connections;
  dcl hash V(dataset:'Connections', ordered:'y');
  V.defineKey('person_id_a');
  V.defineData('person_id_a','cluster');
  dcl hiter Vi('V');
  V.defineDone();

声明所有连接的哈希对象(图形的边缘):

Declare hash object for all connections (edges of the graph):

  dcl hash E(dataset:'Connections', multidata:'y');
  E.defineKey('person_id_a');
  E.defineData('person_id_a','person_id_b');
  E.defineDone();

为队列声明哈希对象及其迭代器:

Declare hash object and its iterator for the queue:

  dcl hash Q(ordered:'y');
  Q.defineKey('qnum','person_id_a');
  Q.defineData('qnum','person_id_a');
  dcl hiter Qi('Q');
  Q.defineDone();

最外面的循环-当队列为空时,用于将没有分配簇的新人员作为下一个簇的根:

The outermost loop - for taking a new person without assigned cluster to be a root of the next cluster, when the queue is empty:

  rc1=Vi.first();
  do while(rc1=0);
      if missing(cluster) then do;
          qnum=1; Q.add(); *qnum-number of the person in the queue, to ensure that new     people are added to the end of the queue.;
          n+1; cluster=n;
          V.replace();*assign cluster number to a person;

在以下两个嵌套循环中,我们使队列中的第一个人出队,并在邻接表中查找与此人相关的所有人.我们将找到的每个连接"添加到队列的末尾.与第一个人交往后,我们将其删除,并让下一个人(现在成为第一个人)出队.所有这些都将在同一群集中.依此类推,直到队列为空.然后,我们为新集群选择一个新的root用户.

In the following two nested loops we de-queue the first person in the queue and look for all people connected to this person in adjacency list. Every found 'connection' we add to the end of the queue. When done with the first person, we delete him/her and de-queue the next one (who became the first now). All of them will be in the same cluster. And so on, until the queue is empty. Then we take a new root person for a new cluster.

          rc2=Qi.first();
          do while(rc2=0);
              qnum=qnum+Q.num_items-1;
              rc3=E.find();
              do while(rc3=0);
                   person_id_a=person_id_b;
                   rc4=V.find();
                   if rc4=0 and missing(cluster) then do;
                       qnum+1; Q.add();
                       cluster=n;
                       V.replace();
                   end;
                   rc3=E.find_next();
              end;
              Qi.first();
              Qi.delete();
              Q.remove();
              Qi=_new_ hiter ('Q');
              rc2=Qi.first();
          end;
      end;
      rc1=Vi.next();
  end;

已分配聚类的人员的输出列表.

Output list of people with assigned clusters.

  V.output(dataset:'clusters');
run;


proc sort data=clusters; by cluster; run;

这篇关于识别给定边的连接图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆