最快的算法有两个非常大的列表之间寻找重叠? [英] Fastest algorithm for finding overlap between two very large lists?

查看:117
本文介绍了最快的算法有两个非常大的列表之间寻找重叠?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图建立在Python的算法过滤的RDF数据大块。

我有一个名单,包括约70000项格式类似于<数据与GT;

我那么有关于6GB价值产品(三元)格式类似于<A> <B> <C>

我想提取所有包含在第一列表中的任何项目的三元组,然后提取包含从第一抽取任何单个项目(净效果是形成这些受一个步骤接图的一个分区任何三元以从第一清单中的种子)。

我一直没能拿出一个很大的算法,这个(没有的事实,我没有正式的CS训练有帮助。)

我来了迄今最好是先分裂的大名单中三元成三个项目清单列表 [<A>中<B >中<C>] 。我再拆分成块,并使用多创建可充分完整的小单子和的大名单一大块,并...

流程

 在大名单线:
    对于小型列表项:
      如果行项目:
       bucket.append(线)
 

这个算法需要相当长一段时间。

有没有更快的方法来做到这一点?如果有一个特定的算法,你可以给我的名字,我会想出如何实现它。

谢谢!

每个注释澄清:

  1. 所有的数据项都是字符串。因此,小的单子可能包含 [米奇,鼠标,敏妮,猫] 和大名单可能是 [米奇,冥王星,Bluto],[约翰,简,吉姆] ...]

  2. 在每个大名单的三重需求,只能有一个选项,以匹配项目的小名单,要数

  3. 所有的小列表中的项目实际上是独一无二的,所以我不认为将它们转换为一个无论如何设置。但我会尽力的。

  4. 我可以创造任何中间结构我想要的。我尝试用一​​个倒排索引构建而成一个货架现在。

解决方案

您可能应该先存储小单子在一组,所以查找速度更快。这prevents经过70000次迭代打算去big_list每一个项目。

  small_list_set =集(small_list)
在big_list行:
    在行项目:
        如果small_list_set项目:
            bucket.append(线)
 

I'm trying to build an algorithm in Python to filter a large block of RDF data.

I have one list consisting of about 70 thousand items formatted like <"datum">.

I then have about 6GB worth of items (triples) formatted like <"A"> <"B"> <"C">

I want to extract all the triples that contain any item in the first list, and then extract any triples that contain any individual item from the first extraction (the net effect is to form a partition of the graph that's connected by one step to the seeds from the first list).

I haven't been able to come up with a great algorithm for this (not helped by the fact that I have no formal CS training.)

The best I've come up with so far is to start by splitting the triples in the big list into a list of three item lists [<"A">, <"B">, <"C">]. I then split that into chunks, and use multiprocessing to create processes that take the full small list and a chunk of the big list and...

for line in big list:
    for item in small list:
      if item in line:
       bucket.append(line)

This algorithm takes quite a while.

Is there any faster way to do this? If there's a specific algorithm, you can just give me the name and I'll figure out how to implement it.

Thanks!

Clarifications per comments:

  1. All the data items are strings. So small list might contain ["Mickey", "Mouse", "Minny", "Cat"] and big list might be [["Mickey","Pluto","Bluto"],["John", "Jane", "Jim]...]

  2. Only one item in each big list triple needs to match an item for the small list for that to count

  3. All of the items in the small list are actually unique, so I didn't think to convert them to a set anyway. But I will try that.

  4. I can create whatever intermediate structures I want. I'm experimenting with an inverted index constructed using a shelve right now.

解决方案

You should probably first store the small list in a set, so lookup is faster. This prevents going through 70,000 iterations for every item in big_list.

small_list_set = set(small_list)
for line in big_list:
    for item in line:
        if item in small_list_set:
            bucket.append(line)

这篇关于最快的算法有两个非常大的列表之间寻找重叠?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆