在未加载重复项的多个python脚本之间共享变量(文件中的数据) [英] share variable (data from file) among multiple python scripts with not loaded duplicates

查看:44
本文介绍了在未加载重复项的多个python脚本之间共享变量(文件中的数据)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想加载 matrix_file.mtx 中包含的一个大矩阵.此负载必须进行一次.一旦将变量 matrix 加载到内存中,我希望许多python脚本共享它们而不重复,以便在bash中(或python本身)具有内存高效的多脚本程序.我可以想象这样的伪代码:

 #加载和共享脚本:进口份额矩阵= open("matrix_file.mtx","r")share.send_to_shared_ram(matrix,as_variable('matrix'))#共享矩阵变量处理脚本_1进口份额pointer_to_matrix = share.share_variable_from_ram('matrix')类型(pointer_to_matrix)#输出:< type'numpy.ndarray'>#共享矩阵变量处理脚本_2进口份额pointer_to_matrix = share.share_variable_from_ram('matrix')类型(pointer_to_matrix)#输出:< type'numpy.ndarray'>... 

这个想法是 pointer_to_matrix 指向RAM中的 matrix ,它仅由n个脚本加载一次(而不是n次).它们是与bash脚本分开调用的(如果可能,也可以从python main调用):

  $ python Load_and_share.py$ python script_1.py -args字符串&$ python script_2.py -args字符串&$ ...$ python script_n.py -args字符串& 

我也对通过硬盘的解决方案感兴趣,即 matrix 可以存储在磁盘上,而 share 对象可以根据需要访问它.尽管如此,RAM中的对象(一种指针)仍可以看作是整个矩阵.

谢谢您的帮助.

解决方案

mmap 模块 numpy.frombuffer ,这很容易:

 导入mmap将numpy导入为np使用open("matrix_file.mtx","rb")作为matfile:mm = mmap.mmap(matfile.fileno(),0,访问权限= mmap.ACCESS_READ)#(可选)在Py3.3 +中的类似UNIX的系统上,添加:#os.posix_fadvise(matfile.fileno(),0,len(mm),os.POSIX_FADV_WILLNEED)#触发文件的后台读入系统缓存,#使用时最大程度地减少页面错误矩阵= np.frombuffer(mm,np.uint8) 

每个进程将单独执行此工作,并获得同一内存的只读视图.您可以根据需要将 dtype 更改为 uint8 以外的其他值.切换到 ACCESS_WRITE 将允许修改共享数据,尽管这将需要同步,并且可能需要显式调用 mm.flush 来真正确保数据在其他进程中得到反映.

更紧密地遵循您的初始设计的解决方案可能是使用 Array (基于 ctypes 类型),然后 register -创建一个返回相同共享<对所有调用者都可以使用code> Array (每个调用者都可以像以前一样通过 numpy.frombuffer 转换返回的 Array ).它涉及的更多(使用一个Python进程初始化 Array ,然后启动 Process es会更容易,由于 fork ,该进程将自动共享它).代码>语义),但它与您所描述的概念最接近.

I would like to load a big matrix contained in the matrix_file.mtx. This load must be made once. Once the variable matrix is loaded into the memory, I would like many python scripts to share it with not duplicates in order to have a memory efficient multiscript program in bash (or python itself). I can imagine some pseudocode like this:

# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))

# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>

# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...

The idea is pointer_to_matrix to point to matrix in RAM, which is only once loaded by the n scripts (not n times). They are separately called from a bash script (or if possible form a python main):

$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &

I'd also be interested in solutions via hard disk, i.e. matrix could be stored at disk while the share object access to it as being required. Nonetheless, the object (a kind of pointer) in RAM can be seen as the whole matrix.

Thank you for your help.

解决方案

Between the mmap module and numpy.frombuffer, this is fairly easy:

import mmap
import numpy as np

with open("matrix_file.mtx","rb") as matfile:
    mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
    # Optionally, on UNIX-like systems in Py3.3+, add:
    # os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
    # to trigger background read in of the file to the system cache,
    # minimizing page faults when you use it

matrix = np.frombuffer(mm, np.uint8)

Each process would perform this work separately, and get a read only view of the same memory. You'd change the dtype to something other than uint8 as needed. Switching to ACCESS_WRITE would allow modifications to shared data, though it would require synchronization and possibly explicit calls to mm.flush to actually ensure the data was reflected in other processes.

A more complex solution that follows your initial design more closely might be to uses multiprocessing.SyncManager to create a connectable shared "server" for data, allowing a single common store of data to be registered with the manager and returned to as many users as desired; creating an Array (based on ctypes types) with the correct type on the manager, then register-ing a function that returns the same shared Array to all callers would work too (each caller would then convert the returned Array via numpy.frombuffer as before). It's much more involved (it would be easier to have a single Python process initialize an Array, then launch Processes that would share it automatically thanks to fork semantics), but it's the closest to the concept you describe.

这篇关于在未加载重复项的多个python脚本之间共享变量(文件中的数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆