锁定免费只读列表在Python中吗? [英] Lock free read only List in Python?

查看:94
本文介绍了锁定免费只读列表在Python中吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经完成了一些基本的性能和内存消耗基准测试,我想知道是否有什么方法可以使事情变得更快...

I've done some basic performance and memory consumption benchmarks and I was wondering if there is any way to make things even faster...

  1. 我有一个包含numpy ndarray的7万个巨型元素列表,并且该列表中的元组中的文件路径.

  1. I have a giant 70,000 element list with a numpy ndarray, and the file path in a tuple in the said list.

我的第一个版本将列表的切细副本传递给python多进程模块中的每个进程,但是它将ram的使用量激增到20+ GB以上

My first version passed a sliced up copy of the list to each of the processes in python multiprocess module, but it would explode ram usage to over 20+ Gigabytes

第二个版本中,我将其移至全局空间,并在每个进程的循环中通过诸如foo [i]之类的索引进行访问,这似乎将其放入具有共享内存区/CoW语义的循环中.因此不会导致内存使用量激增(保持在约3 GB)

The second version I moved it into the global space and access it via index such as foo[i] in a loop in each of my processes which seems to put it into a shared memory area/CoW semantics with the processes thus it does not explode the memory usage (Stays at ~3 Gigabytes)

但是根据性能基准/跟踪,现在看来大部分应用程序时间都花在了获取"模式下……

However according to the performance benchmarks/tracing, it seems like the large majority of the application time is now spent in "acquire" mode...

因此,我想知道是否有什么办法可以将列表转换为某种无锁/只读状态,以便我可以取消获取步骤的一部分,从而进一步加快访问速度.

So I was wondering if there is any way i can somehow turn this list into some sort of lockfree/read only so that I can do away with part of the acquire step to help speed up access even more.

这是应用程序配置文件的前几行输出

Edit 1: Here's the top few line output of the profiling of the app

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   65 2450.903   37.706 2450.903   37.706 {built-in method acquire}
39320    0.481    0.000    0.481    0.000 {method 'read' of 'file' objects}
  600    0.298    0.000    0.298    0.000 {posix.waitpid}
   48    0.271    0.006    0.271    0.006 {posix.fork}

这是列表结构的示例:

Edit 2: Here's a example of the list structure:

# Sample code for a rough idea of how the list is constructed
sim = []
for root, dirs, files in os.walk(rootdir):
    path = os.path.join(root, filename)
    image= Image.open(path)
    np_array = np.asarray(image)
    sim.append( (np_array, path) )

# Roughly it would look something like say this below
sim = List( (np.array([[1, 2, 3], [4, 5, 6]], np.int32), "/foobar/com/what.something") )

此后,SIM卡列表将是只读的.

Then henceforth the SIM list is to be read only.

推荐答案

multiprocessing模块完全提供了您所需要的:具有可选锁定的共享数组,即

The multiprocessing module provides exactly what you need: a shared array with optional locking, namely the multiprocessing.Array class. Pass lock=False to the constructor to disable locking.

编辑(考虑您的更新):实际上,涉及的事情比我最初预期的要多得多.列表中所有元素的数据都需要在共享内存中创建.是否将列表本身(即指向实际数据的指针)放在共享内存中并不重要,因为与所有文件的数据相比,这应该很小.要将文件数据存储在共享内存中,请使用

Edit (taking into account your update): Things are actually considerably more involved than I initially expected. The data of all elements in your list needs to be created in shared memory. Whether you put the list itself (i.e. the pointers to the actual data) in shared memory, does not matter too much because this should be a small compared to the data of all files. To store the file data in shared memory, use

shared_data = multiprocessing.sharedctypes.RawArray("c", data)

其中,data是您从文件中读取的数据.要将其用作其中一个进程的NumPy数组,请使用

where data is the data you read from the file. To use this as a NumPy array in one of the processes, use

numpy.frombuffer(shared_data, dtype="c")

这将为共享数据创建一个NumPy数组视图.同样,要将路径名放入共享内存中,请使用

which will create a NumPy array view for the shared data. Similarly, to put the path name into shared memory, use

shared_path = multiprocessing.sharedctypes.RawArray("c", path)

其中path是普通的Python字符串.在您的进程中,您可以使用shared_path.raw作为Python字符串来访问它.现在将(shared_data, shared_path)附加到您的列表中.该列表将被复制到其他进程,但实际数据不会复制.

where path is an ordinary Python string. In your processes, you can access this as a Python string by using shared_path.raw. Now append (shared_data, shared_path) to your list. The list will get copied to the other processes, but the actual data won't.

我最初打算将multiprocessing.Array用于实际列表.这将是完全可能的,并将确保列表本身(即数据的指针)也位于共享内存中.现在,我认为这一点都不重要,只要共享实际数据即可.

I initially meant to use an multiprocessing.Array for the actual list. This would be perfectly possible and would ensure that also the list itself (i.e. the pointers to the data) is in shared memory. Now I think this is not that important at all, as long as the actual data is shared.

这篇关于锁定免费只读列表在Python中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆