分组字典键whos值(子列表)具有共同条目 [英] Grouping dictionary keys whos values (sublist) have entries in common

查看:101
本文介绍了分组字典键whos值(子列表)具有共同条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

参考这个问题从使用SQLite的共享条目中查找两个列的值组



我在TCL中试了一下,但是我在某个地方的圈子中迷路了:

  set MyList [list 50 {23 25} 34 {6 11} 78 {25 9} 45 {2 45} 39 {12 9} 40 {6 2}] 

set AllGroups [list]

dict for {kv} $ MyList {
set Group $ k
foreach {N1 N2} $ v {break}

dict for {k2 v2} $ MyList {
foreach {N1_2 N2_2} $ v2 {break}
如果{$ N1 == $ N1_2} {
追加组$ k2
}
如果{$ N1 == $ N2_2} {
追加组$ k2
}
}
lappend AllGroups $ Group
}

输出是: p>

50 3440 78 4540 39 4040



看起来像一个有前途的解决方案。我认为循环看起来正确,我的错误在哪里?任何帮助是赞赏。也许我应该使用结构体?

解决方案

这个问题看起来很简单,但实际上令人惊讶的是很难得到。因此,相当长的解决方案。这个特别的问题已经被研究了很多,算法可以在网络上找到,但是当然我不得不这样做,并且想出自己的实现。这意味着,虽然它对我已经尝试过的数据正常工作,但可能效率低下,可能仍然包含错误。我认为这是一个相当天真的解决方案,在CS的意义上是公平的。



(这在研究中没有帮助对于这个解决方案,我发现我与现在的计算机科学术语(我没有真正进入CS二十多年),我拿起了最大公共子图一词来描述我正在搜索的内容因为,但现在看来实际上是一些微妙的区别啊,好吧,正如我所说,我放弃了试图使用一个既定的算法,并且滚动了我自己。)



问题有一组每个都有两个节点的 EID (CS-说:顶点);在EID之间共享的节点在它们之间形成直接连接边缘),目标是找到(CS-speak: ,可能不是具有直接和间接连接的EID的最大通用子图,可能传递闭包

。 >

为了使解决方案易于处理,我将流程分为以下步骤:


  1. 找到列表的连接(每个连接是通过共享节点连接的两个或多个EID的列表
    ,或者
    单个,未连接的EID)

  2. 构建连接字典,
    的键是EID,值是与直接连接(通过一个或多个节点)的
    的EID列表 - 注意
    某些值在这一点上的列表可能已经是这个EID的束
    ,而大多数只是这样的串的子集。

  3. finally,build
    a字典,其中键是单调的折叠
    整数(即我编号的项目),值是
    EID形成束的列表。

我描述每一步进一步执行它的命令。

  proc main table {
#此命令将所有处理步骤放在一起。该表
#设置在页面的底部。

puts [set data [makedatadictionary $ table]]

puts [set connections [findconnections $ data]]

puts [set connectionsdict \
[makeconnectionsdict [dict keys $ data] $ connections]]

set bunddict [makebunchdict $ linksdict]

puts\\\
CF EIDs\\\
-- ---------
dict for {cf EIDs} $ bunddict {
puts$ cf $ EIDs
}
}

这是构建束字典的命令。它处理输入字典中的每个键,并通过递归地查看其值列表中的每个EID来收集直接或间接连接到它的EID。这里的一个非常明显的缺陷是,子图中的每个EID都将产生相同的收集的EID列表(尽管可能在不同的排序顺序中),所以在添加之前,我们必须检查子图是否还没有在字典中

  proc makebunchdict linksdict {
#给定一个包含EID密钥和EID
#令牌的连接字典代表直接连接的EID,这个命令
#选出直接或间接连接的EID串。
set result [dict create]
set n 0
dict为{key - } $ linksdict {
set gather [list]

recursivelycollect $ key $ linksdict收集

集合[lsort $已收集]
如果{$收集ni [dict values $ result]} {
dict set result [incr n] $ collected
}
}
设置结果
}

这是递归访问每个EID密钥的命令。它发现每个EID已经在收集的EID的列表中停止。

  proc recursivelycollect {key connectionsdict varName} {
#递归访问直接连接的
#组中的每个EID,将唯一的EID保存在原始调用者堆栈帧中的
#中的变量中。
upvar 1 $ varName收集
lappend收集$ key
foreach n [dict get $ linksdict $ key] {
如果{$ n $ $收集} {
recursivelycollect $ n $ linksdict收集
}
}
}

是设置连接字典的命令。这是非常简单的:对于每个键,它构建一个列表,它是键出现的所有列表的列表联合。然后将每个生成的列表减少到唯一的成员。

  proc makeconnectionsdict {keys connections} {
#给定一组EID令牌的密钥和包含直接连接的EID的列表
#,该命令构造一个
#字典,其中EID令牌作为键,每个
#直接连接集的列表EID显示为值。注意
#,很可能
#[dict values $ connections]!= [dict values $ result]
#,因为连接列表连接了一个$ b $的EID列表b#单个节点,而这里的结果列表具有连接
#一个或多个节点的EID。
设置结果[dict create]
foreach key $ keys {
foreach connection $ connections {
if {$ key in $ connection} {
dict lappend result $ key {*} $ connection
}
}
dict set result $ key [lsort -unique [dict get $ result $ key]]
}
设置结果
}

这是找到哪些EID相互连接的命令。这是非常简单的:它基本上只是输入字典的反转。我删除了最明显的重复项。

  proc findconnections data {
#此命令发现键之间的直接连接在
#字典中传递给它。如果两个密钥之间共享任何成员的值列表,则直接连接存在
#。
#例如
#a {b c}和d {e c}直接相连,但
#a {b c}和f {g h}不是。

#结果是列表列表,其中每个子列表都包含
#*两个或更多个键:这些键通过
#单个值相互连接列表成员,或
#*一个单键:这些键根本没有连接。
设置结果[dict create]
dict为{key value} $ data {
foreach val $ value {
dict lappend result $ val $ key
}
}
#只返回结果字典中的值,并且只有
#在这个特殊的唯一值。
lsort -unique [dict values $ result]
}

这是命令,将EID /节点/节点数据表简单转换为字典。这只是一个方便的命令,让我以更可行的格式定义输入。

  proc makedatadictionary table {
#将N x 3表转换为N个项目的字典,其中
#的键是第1列中的值,其值为第2列和第3列中的
#列表。
设置数据[dict create]
foreach {col1 col2 col3} $ table {
dict set data $ col1 [list $ col2 $ col3]
}
设置数据
}

这是你如何开始的。该参数由表示第一列中的EID令牌和第二列和第三列中的节点编号的数据组成。实际值不影响此代码的工作,但不能列出任何值。



(在此示例中,EID 50 - 40来自OP和大概是现实生活中的数据,其余的由我来测试解决方案。)

  main {
50 23 25
34 6 11
78 25 9
45 2 45
39 12 9
40 6 2
99 1 3
98 4 5
97 4 7
}

(注意:评论中提到的Hoodiecrow是我,我以前使用了这个昵称。)


Referring to this problem Finding groups of values from two colums which have entries in common using SQLite

I gave it a try in TCL, but I got lost in the loops somewhere:

set MyList [ list 50 { 23 25 } 34 { 6 11 } 78 { 25 9 } 45 { 2 45 } 39 { 12 9 } 40 { 6 2 }]

set AllGroups [list]

 dict for {k v} $MyList {
   set Group $k       
   foreach {N1 N2} $v {break}

            dict for {k2 v2} $MyList {
                  foreach {N1_2 N2_2} $v2 {break}
                  if { $N1 == $N1_2 } {                   
                      append Group $k2
                    }
                 if { $N1 == $N2_2 } {                   
                     append Group $k2
                   }    
          }
    lappend AllGroups $Group
}

The output is:

50 3440 78 4540 39 4040

which looks like a promising start to a solution. I think the loop looks correct, where is my mistake? Any help is appreciated. Maybe I should use structs instead?

解决方案

This problem looks simple but is actually surprisingly hard to get right. Hence, the rather long solution. This particular problem has been studied a lot and algorithms can be found on the web, but of course I had to do it the hard way and come up with my own implementation. This means that while it works correctly for the data I've tried it with, it may be inefficient and may still contain bugs. I think it'd be fair to say that its a rather "naive" solution, in the CS sense of the word.

(It doesn't help that while researching for this solution, I found that I'm way out of touch with current computer science terminology (I haven't really been into CS for over twenty years). I picked up the term "maximal common subgraph" to describe what I was searching for, but now it seems that's actually something subtly different. Ah well, as I said I gave up trying to use an established algorithm and rolled my own anyway.)

The problem has a set of EIDs (CS-speak: vertices) that each have two nodes; nodes shared between EIDs form a direct connection (edge) between them, and the object is to find bunches (CS-speak: not cliques, probably not maximal common subgraphs, possibly transitive closures) of EIDs that have direct and indirect connections.

To make the solution tractable, I split the process into steps:

  1. find the list of connections (where each connection is either a list of two or more EIDs connected through a shared node, or else a single, unconnected, EID)
  2. build a dictionary of connections where the keys are EIDs and the values are lists of EIDs that they appear in direct connections with (through one or more nodes) -- note that some of the value lists at this point may already be the bunch of that EID, while most are just subsets of such bunches.
  3. finally, build a dictionary where the keys are monotonically increasing integers (i.e. I numbered the items) and the values are lists of EIDs forming "bunches".

I describe each step a bit further next to the command that performs it.

proc main table {
    # This command puts all the processing steps together. The table 
    # is set up at the bottom of the page.

    puts [set data [makedatadictionary $table]]

    puts [set connections [findconnections $data]]

    puts [set connectionsdict \
        [makeconnectionsdict [dict keys $data] $connections]]

    set bunchdict [makebunchdict $connectionsdict]

    puts "\nCF EIDs\n-----------"
    dict for {cf EIDs} $bunchdict {
        puts "$cf  $EIDs"
    }
}

This is the command that constructs the bunch dictionary. It processes each key in the input dictionary and collects the EIDs directly or indirectly connected to it by recursively looking at each one of the EIDs in its value list. A (very very obvious) pitfall here is that every EID in a subgraph will produce the same list of collected EIDs (though likely in different sorting orders), so we have to check if the subgraph isn't already in the dictionary before we add it.

proc makebunchdict connectionsdict {
    # Given a connections dictionary containing EID keys and EID 
    # tokens representing directly connected EIDs, this command 
    # picks out bunches of EIDs, directly or indirectly connected.
    set result [dict create]
    set n 0
    dict for {key -} $connectionsdict {
        set collected [list]

        recursivelycollect $key $connectionsdict collected

        set collected [lsort $collected]
        if {$collected ni [dict values $result]} {
            dict set result [incr n] $collected
        }
    }
    set result
}

This is the command that recursively visits each EID key. It stops when every EID it finds is already in the list of collected EIDs.

proc recursivelycollect {key connectionsdict varName} {
    # Recursively visits every EID in a directly connected 
    # group, saving unique EIDs in a variable that lives in 
    # the original caller's stack frame.
    upvar 1 $varName collected
    lappend collected $key
    foreach n [dict get $connectionsdict $key] {
        if {$n ni $collected} {
            recursivelycollect $n $connectionsdict collected
        }
    }
}

This is the command that sets up a dictionary of connections. It's fairly straightforward: for every key it builds a list that is the list union of all lists where the key appears. It then reduces each resulting list to unique members.

proc makeconnectionsdict {keys connections} {
    # Given a set of keys which are EID tokens, and a list of lists 
    # containing directly connected EIDs, this command constructs a 
    # dictionary with the EID tokens as keys and the lists of every 
    # direct connection set that the EID appears in as values. Note 
    # that it's very likely that
    #   [dict values $connections] != [dict values $result]
    # since the list of connections has lists of EIDs connected by a
    # single node, while the result list here has EIDs connected by 
    # one or more nodes.
    set result [dict create]
    foreach key $keys {
        foreach connection $connections {
            if {$key in $connection} {
                dict lappend result $key {*}$connection
            }
        }
        dict set result $key [lsort -unique [dict get $result $key]]
    }
    set result
}

This is the command that finds out which EIDs are connected to each other. It's very straightforward: it's basically just an inversion of the input dictionary. I remove the most obvious duplicates at the end.

proc findconnections data {
    # This command discovers direct connections between keys in the 
    # dictionary which is passed to it. A direct connection exists 
    # between two keys if they share any members of their value lists. 
    # E.g. 
    #   a {b c}  and  d {e c}  are directly connected, but
    #   a {b c}  and  f {g h}  are not.
    #
    # The result is a list of lists, where each sublist either contains 
    #  * two or more keys: these keys are connected to each other by a 
    #    single value list member, or
    #  * a single key: these keys have no connections at all.
    set result [dict create]
    dict for {key value} $data {
        foreach val $value {
            dict lappend result $val $key
        }
    }
    # Return only the values from the result dictionary, and only 
    # trivially unique values at that.
    lsort -unique [dict values $result]
}

This is the command that trivially converts a EID/node/node data table to a dictionary. It's just a convenience command to let me define the input in a more workable format.

proc makedatadictionary table {
    # Convert a N x 3 table to a dictionary of N items where 
    # the key is the value in column 1 and its value is the 
    # list of the values in column 2 and 3.
    set data [dict create]
    foreach {col1 col2 col3} $table {
        dict set data $col1 [list $col2 $col3]
    }
    set data
}

This is how you get it started. The argument consists of data that represents EID tokens in the first column and node numbers in the second and third columns. The actual values don't affect the workings of this code, but none of the values should be lists.

(In this example, the EIDs 50 -- 40 come from the OP and are presumably real-life data, the rest were made up by me to test the solution.)

main {
    50  23  25
    34   6  11
    78  25   9
    45   2  45
    39  12   9
    40   6   2
    99   1   3
    98   4   5
    97   4   7
}

(Note: the 'Hoodiecrow' mentioned in the comments is me, I used that nick earlier.)

这篇关于分组字典键whos值(子列表)具有共同条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆