用Python确定矩阵中所有可能的交点的最佳方法? [英] Best way in Python to determine all possible intersections in a matrix?

查看:87
本文介绍了用Python确定矩阵中所有可能的交点的最佳方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,如果我有一个矩阵(列表列表),其中每一列代表一个唯一的单词,每一行代表一个不同的文档,并且每个条目都是1或0,表示是否存在给定列的单词给定行的文档.

So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.

我想知道的是如何确定单词和文档的所有可能组合,其中一个以上的单词与一个以上的文档共同.结果可能类似于:

What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], 
  [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
  ...

,以此类推,以获取每种可能的组合.很希望有一个解决方案能够提供完整的组合集,而一个解决方案只能产生不是另一个子集的组合,因此从示例中来看,[[Docid_3, Docid_5], ['word1', 'word17']]则不是,因为它是第一个示例的完整子集.

and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']] since it's a complete subset of the first example.

我觉得这里有一个优雅的解决方案,只是我没想到,啤酒也无济于事.

I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.

谢谢.

推荐答案

首先,构建从文档ID到单词集的映射-01的矩阵非常笨拙,可以直接处理.如果我没看错的话,列标题"(单词)是矩阵中的第一个列表(减去第一行),行标题"(docid)是每一行的第一列(减去第一行) ).然后(假设使用Python 2.6或更高版本):

First, build a mapping from document ID to set of words -- your matrix of 0 and 1 is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):

def makemap(matrix):
  im = iter(matrix)
  words = next(im)[1:]
  themap = {}
  for row in im:
    mapent = set()
    docid = row[0]
    for w, bit in zip(words, row[1:]):
      try:
        if bit: mapent.add(w)
      except:
        print 'w is %r' % (w,)
        raise
    themap[docid] = mapent
  return themap

现在,您需要检查文档的所有可行子集-子集的总数非常庞大,因此您确实想尽可能地修剪搜索树,并以蛮力方式生成所有子集(例如,通过在itertools.combinations上循环不同的长度)当然都不会执行任何修剪操作.

Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations for various lengths) will not perform any pruning of course.

我将从所有2种组合开始(所有docid对-当然itertools.combinations都适合),然后使第一批(具有2+个共同点的词对)为可行2长度"子集".可以在tuplefrozenset docid作为键的情况下进行另一个映射.

I would start with all 2-combinations (all pairs of docids -- itertools.combinations is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuples or frozensets of docids as the keys.

然后,要制作可行的(N + 1)个长度的子集,我只会尝试通过每个附加的docid 扩展现有的可行的N个长度的子集(检查总交点仍为2 +当然很长).这至少会进行一些修剪,而不是盲目尝试所有2**N子集(甚至只是长度至少为两个的2**N - N - 1子集;-).

Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N subsets (or even just the 2**N - N - 1 subsets of length at least two;-).

通过记录证明无法扩展某个N长度子集的所有docid,也许有可能做得更好-毫无疑问地尝试对源自其的(N + 1)个长度子集进行扩展.值得尝试将其作为第二个修剪/优化级别.

It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.

可能会有进一步的调整,但您可能会做进一步的修剪,但没有立即想到的东西,因此这就是我要开始的地方. (为了提高可读性,下面我不会使用微优化来解决问题,例如使用iteritems代替itemsfrozenset代替tuple等等-考虑到这些序列都是全部,它们可能是微不足道的O(N)与计算结构的指数大小的比较,尽管在调整/优化阶段当然值得尝试.

There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems in lieu of items, frozensets in lieu of tuples, etc -- they're probably marginal given those sequences are all O(N) vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).

def allfeasiblesubsets(matrix):
  mapping = makemap(matrix)
  docids = sorted(mapping.keys())
  feasible_len2 = {}
  dont_bother = dict((d, set([d])) for d in docids)
  for d1, d2 in itertools.combinations(docids, 2):
    commonw = mapping[d1].intersection(mapping[d2])
    if len(commonw) >= 2:
      feasible_len2[d1, d2] = commonw
    else:
      dont_bother[d1].add(d2)
      dont_bother[d2].add(d1)
  all_feasible = [feasible_len2]
  while all_feasible[-1]:
    feasible_Np1 = {}
    for ds, ws in all_feasible[-1].items():  
      md = max(ds)        
      for d, w in mapping.items():
        if d <= md or any(d in dont_bother[d1] for d1 in ds):
          continue
        commonw = w.intersection(ws)
        if len(commonw) >= 2:
          feasible_Np1[ds+(d,)] = commonw
    all_feasible.append(feasible_Np1)
  return all_feasible[:-1]

您会注意到,我仅对建议的进一步修剪"应用了轻微的形式-dont_bother仅记录了一个医生与其他医生之间的不兼容性"(共2个字以内)–这可能会有所帮助如果有几对这样的不兼容docid",并且简单且合理地不引人注目,但在修剪方面不像更难的完整"替代项那样强大.我还试图将feasible*字典中的所有键都保留为docid的已排序元组(如itertools.combinations最初提供的对),以避免重复,从而避免重复工作.

You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible* dicts as sorted tuples of docids (as the itertools.combinations originally provides for the pairs) to avoid duplications and therefore redundant work.

这是我尝试过的小例子(当然,在itertoolscollectionsimport之后,与这些函数位于同一文件中):

Here's the small example I've tried (in the same file as these functions after, of course, the import for itertools and collections):

mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
        ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
        ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]

mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs

显示为"OK"的结果是:

The results, which appear OK, are:

{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]

但是,当然,由于我尚未进行全面测试,因此可能仍然存在一些错误.顺便说一句,我希望很明显,这里提供的结果(各种长度增加的字典列表,每个字典都将docids-sets的有序元组形式作为键,而将其常用词的sets作为值)可以很容易地发布. -处理为您可能更喜欢的任何其他形式,例如嵌套列表.

but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.

(不是很重要,但是我在示例中使用的文字是古老的意大利谚语;-).

(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).

这篇关于用Python确定矩阵中所有可能的交点的最佳方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆