在未加载重复项的多个python脚本之间共享变量(文件中的数据) [英] share variable (data from file) among multiple python scripts with not loaded duplicates
问题描述
我想加载 matrix_file.mtx
中包含的一个大矩阵.此负载必须进行一次.一旦将变量 matrix
加载到内存中,我希望许多python脚本共享它们而不重复,以便在bash中(或python本身)具有内存高效的多脚本程序.我可以想象这样的伪代码:
#加载和共享脚本:进口份额矩阵= open("matrix_file.mtx","r")share.send_to_shared_ram(matrix,as_variable('matrix'))#共享矩阵变量处理脚本_1进口份额pointer_to_matrix = share.share_variable_from_ram('matrix')类型(pointer_to_matrix)#输出:< type'numpy.ndarray'>#共享矩阵变量处理脚本_2进口份额pointer_to_matrix = share.share_variable_from_ram('matrix')类型(pointer_to_matrix)#输出:< type'numpy.ndarray'>...
这个想法是 pointer_to_matrix
指向RAM中的 matrix
,它仅由n个脚本加载一次(而不是n次).它们是与bash脚本分开调用的(如果可能,也可以从python main调用):
$ python Load_and_share.py$ python script_1.py -args字符串&$ python script_2.py -args字符串&$ ...$ python script_n.py -args字符串&
我也对通过硬盘的解决方案感兴趣,即 matrix
可以存储在磁盘上,而 share
对象可以根据需要访问它.尽管如此,RAM中的对象(一种指针)仍可以看作是整个矩阵.
谢谢您的帮助.
在 mmap
模块和 numpy.frombuffer
,这很容易:
导入mmap将numpy导入为np使用open("matrix_file.mtx","rb")作为matfile:mm = mmap.mmap(matfile.fileno(),0,访问权限= mmap.ACCESS_READ)#(可选)在Py3.3 +中的类似UNIX的系统上,添加:#os.posix_fadvise(matfile.fileno(),0,len(mm),os.POSIX_FADV_WILLNEED)#触发文件的后台读入系统缓存,#使用时最大程度地减少页面错误矩阵= np.frombuffer(mm,np.uint8)
每个进程将单独执行此工作,并获得同一内存的只读视图.您可以根据需要将 dtype
更改为 uint8
以外的其他值.切换到 ACCESS_WRITE
将允许修改共享数据,尽管这将需要同步,并且可能需要显式调用 mm.flush
来真正确保数据在其他进程中得到反映.>
更紧密地遵循您的初始设计的解决方案可能是使用 Array (基于 ctypes
类型),然后 register
-创建一个返回相同共享<对所有调用者都可以使用code> Array (每个调用者都可以像以前一样通过 numpy.frombuffer
转换返回的 Array
).它涉及的更多(使用一个Python进程初始化 Array
,然后启动 Process
es会更容易,由于 fork ,该进程将自动共享它).代码>语义),但它与您所描述的概念最接近.
I would like to load a big matrix contained in the matrix_file.mtx
. This load must be made once. Once the variable matrix
is loaded into the memory, I would like many python scripts to share it with not duplicates in order to have a memory efficient multiscript program in bash (or python itself). I can imagine some pseudocode like this:
# Loading and sharing script:
import share
matrix = open("matrix_file.mtx","r")
share.send_to_shared_ram(matrix, as_variable('matrix'))
# Shared matrix variable processing script_1
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
# Shared matrix variable processing script_2
import share
pointer_to_matrix = share.share_variable_from_ram('matrix')
type(pointer_to_matrix)
# output: <type 'numpy.ndarray'>
...
The idea is pointer_to_matrix
to point to matrix
in RAM, which is only once loaded by the n scripts (not n times). They are separately called from a bash script (or if possible form a python main):
$ python Load_and_share.py
$ python script_1.py -args string &
$ python script_2.py -args string &
$ ...
$ python script_n.py -args string &
I'd also be interested in solutions via hard disk, i.e. matrix
could be stored at disk while the share
object access to it as being required. Nonetheless, the object (a kind of pointer) in RAM can be seen as the whole matrix.
Thank you for your help.
Between the mmap
module and numpy.frombuffer
, this is fairly easy:
import mmap
import numpy as np
with open("matrix_file.mtx","rb") as matfile:
mm = mmap.mmap(matfile.fileno(), 0, access=mmap.ACCESS_READ)
# Optionally, on UNIX-like systems in Py3.3+, add:
# os.posix_fadvise(matfile.fileno(), 0, len(mm), os.POSIX_FADV_WILLNEED)
# to trigger background read in of the file to the system cache,
# minimizing page faults when you use it
matrix = np.frombuffer(mm, np.uint8)
Each process would perform this work separately, and get a read only view of the same memory. You'd change the dtype
to something other than uint8
as needed. Switching to ACCESS_WRITE
would allow modifications to shared data, though it would require synchronization and possibly explicit calls to mm.flush
to actually ensure the data was reflected in other processes.
A more complex solution that follows your initial design more closely might be to uses multiprocessing.SyncManager
to create a connectable shared "server" for data, allowing a single common store of data to be registered with the manager and returned to as many users as desired; creating an Array
(based on ctypes
types) with the correct type on the manager, then register
-ing a function that returns the same shared Array
to all callers would work too (each caller would then convert the returned Array
via numpy.frombuffer
as before). It's much more involved (it would be easier to have a single Python process initialize an Array
, then launch Process
es that would share it automatically thanks to fork
semantics), but it's the closest to the concept you describe.
这篇关于在未加载重复项的多个python脚本之间共享变量(文件中的数据)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!