如何使用多重处理将重复项删除到非常大的列表中? [英] How to use multiprocessing to drop duplicates in a very big list?

查看:117
本文介绍了如何使用多重处理将重复项删除到非常大的列表中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个巨大的列表,例如包含随机数

Let's say I have a huge list containing random numbers for example

L = [random.randrange(0,25000000000) for _ in range(1000000000)]

我需要删除此列表中的重复项

I need to get rid of the duplicates in this list

我为包含较少元素的列表编写了这段代码

I wrote this code for lists containing a smaller number of elements

def remove_duplicates(list_to_deduplicate):
seen = set()
result=[]
for i in list_to_deduplicate:
    if i not in seen:
        result.append(i)
        seen.add(i)
return result

在上面的代码中,我创建了一个集合,这样我就可以记住如果列表中没有数字,那么我正在处理的列表中已经出现了什么数字,然后将其添加到结果列表中,我需要返回并保存它放在集合中,因此不会再次添加到结果列表中

In the code above I create a set so I can memorize what numbers have already appeared in the list I'm working on if the number is not in the set then I add it to the result list I need to return and save it in the set so it won't be added again in the result list

现在列表中的1000000个数字都很好,我可以很快得到一个结果,但是对于优于1000个数字的数字,比如说出现1000000000个问题,我需要使用计算机上的不同内核来尝试解决问题,然后结合使用多个过程的结果

Now for 1000000 number in a list all is good I can get a result fast but for numbers superior to let's say 1000000000 problems arise I need to use the different cores on my machine to try and break up the problem and then combine the results from multiple processes

我的第一个猜测是使所有过程都可以访问一个集合,但是会出现许多复杂情况 当另一个进程添加到集合中时,如何读取一个进程,我什至不知道是否可以在进程之间共享一个集合,我知道我们可以使用队列或管道,但是我不确定如何使用它

My first guess was to make a set accessible to all processes but many complications will arise How can a process read while maybe another one is adding to the set and I don't even know if it is possible to share a set between processes I know we can use a Queue or a pipe but I'm not sure on how to use it

有人可以给我建议什么是解决此问题的最佳方法 我愿意接受任何新想法

Can someone give me an advice on what is the best way to solve this problem I am open to any new idea

推荐答案

我怀疑您的最大清单是否足够大,以至于多处理可以改善计时.使用numpy和多线程可能是您最好的机会.

I'm skeptic even your greatest list is big enough so that multiprocessing would improve timings. Using numpy and multithreading is probably your best chance.

多处理会带来相当大的开销,并且会增加内存消耗,例如前面提到的@Frank Merrow. 但是,对于多线程而言,情况并非如此.重要的是不要混淆这些术语,因为进程和线程是不同的. 同一进程中的线程共享内存,而不同进程则不共享内存.

Multiprocessing introduces quite some overhead and increases memory consumption like @Frank Merrow rightly mentioned earlier. That's not the case (to that extend) for multithreading, though. It's important to not mix these terms up because processes and threads are not the same. Threads within the same process share their memory, distinct processes do not.

在Python中使用多核的问题是 GIL ,它不会允许多个线程(在同一进程中)并行执行Python字节码.一些C扩展(例如numpy)可以释放GIL,这使您可以从具有多线程的多核并行中获利.这是您一个机会,只需使用numpy即可在重大改进的基础上加快速度.

The problem with going multi-core in Python is the GIL, which doesn't allow multiple threads (in the same process) to execute Python bytecode in parallel. Some C-extensions like numpy can release the GIL, this enables profiting from multi-core parallelism with multithreading. Here's your chance to get some speed up on top of a big improvement just by using numpy.

from multiprocessing.dummy import Pool  # .dummy uses threads
import numpy as np

r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
n_threads = 8

result = np.unique(np.concatenate(
    Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()

使用numpy和线程池,拆分数组,使子数组在单独的线程中唯一,然后连接子数组并使重组后的数组再次变得唯一. 最后删除重组数组的重复项是必要的,因为在子数组中只能识别 local 重复项.

Use numpy and a thread-pool, split up the array, make the sub-arrays unique in separate threads, then concatenate the sub-arrays and make the recombined array once more unique again. The final dropping of duplicates for the recombined array is necessary because within the sub-arrays only local duplicates can be identified.

使用低熵数据(许多重复项)" rel ="nofollow noreferrer"> pandas.unique 而不是numpy.unique可以更快.与 numpy.unique 不同,它还保留了外观顺序

For low entropy data (many duplicates) using pandas.unique instead of numpy.unique can be much faster. Unlike numpy.unique it also preserves order of appearance.

请注意,仅当numpy函数尚未多线程

Note that using a thread-pool like above makes only sense if the numpy-function is not already multi-threaded under the hood by calling into low-level math libraries. So, always test to see if it actually improves performance and don't take it for granted.

使用100M随机生成的整数进行测试,范围为:

Tested with 100M random generated integers in the range:

  • 高熵:0-25_000_000_000(199560个重复项)
  • 低熵:0-1000
import time
import timeit
from multiprocessing.dummy import Pool  # .dummy uses threads

import numpy as np
import pandas as pd


def time_stmt(stmt, title=None):
    t = timeit.repeat(
        stmt=stmt,
        timer=time.perf_counter_ns, repeat=3, number=1, globals=globals()
    )
    print(f"\t{title or stmt}")
    print(f"\t\t{min(t) / 1e9:.2f} s")


if __name__ == '__main__':

    n_threads = 8  # machine with 8 cores (4 physical cores)

    stmt_np_unique_pool = \
"""
np.unique(np.concatenate(
    Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
"""

    stmt_pd_unique_pool = \
"""
pd.unique(np.concatenate(
    Pool(n_threads).map(pd.unique, np.array_split(r, n_threads)))
).tolist()
"""
    # -------------------------------------------------------------------------

    print(f"\nhigh entropy (few duplicates) {'-' * 30}\n")
    r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)

    r = list(r)
    time_stmt("list(set(r))")

    r = np.asarray(r)
    # numpy.unique
    time_stmt("np.unique(r).tolist()")
    # pandas.unique
    time_stmt("pd.unique(r).tolist()")    
    # numpy.unique & Pool
    time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
    # pandas.unique & Pool
    time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")

    # ---
    print(f"\nlow entropy (many duplicates) {'-' * 30}\n")
    r = np.random.RandomState(42).randint(0, 1000, 100_000_000)

    r = list(r)
    time_stmt("list(set(r))")

    r = np.asarray(r)
    # numpy.unique
    time_stmt("np.unique(r).tolist()")
    # pandas.unique
    time_stmt("pd.unique(r).tolist()")
    # numpy.unique & Pool
    time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
    # pandas.unique() & Pool
    time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")

就像您在下面的时间中看到的那样,仅使用numpy而不使用多线程已成为最大的性能改进.另外请注意,对于许多重复项,pandas.unique()numpy.unique()(仅)要快.

Like you can see in the timings below, just using numpy without multithreading already accounts for the biggest performance improvement. Also note pandas.unique() being faster than numpy.unique() (only) for many duplicates.

high entropy (few duplicates) ------------------------------

    list(set(r))
        32.76 s
    np.unique(r).tolist()
        12.32 s
    pd.unique(r).tolist()
        23.01 s
    numpy.unique() & Pool
        9.75 s
    pandas.unique() & Pool
        28.91 s

low entropy (many duplicates) ------------------------------

    list(set(r))
        5.66 s
    np.unique(r).tolist()
        4.59 s
    pd.unique(r).tolist()
        0.75 s
    numpy.unique() & Pool
        1.17 s
    pandas.unique() & Pool
        0.19 s

这篇关于如何使用多重处理将重复项删除到非常大的列表中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆