在Python中将数据写入磁盘作为后台进程 [英] Write data to disk in Python as a background process

查看:118
本文介绍了在Python中将数据写入磁盘作为后台进程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个使用Python编写的程序,该程序基本上可以执行以下操作:

I have a program in Python that basically does the following:

for j in xrange(200):
    # 1) Compute a bunch of data
    # 2) Write data to disk

1)大约需要2-5分钟
2)大约需要1分钟

1) takes about 2-5 minutes
2) takes about ~1 minute

请注意,太多数据无法保存在内存中.

Note that there is too much data to keep in memory.

理想情况下,我想做的是将数据写到磁盘上,从而避免使CPU空转.这在Python中可行吗?谢谢!

Ideally what I would like to do is write the data to disk in a way that avoids idling the CPU. Is this possible in Python? Thanks!

推荐答案

您可以尝试使用多个进程像这样:

import multiprocessing as mp

def compute(j):
    # compute a bunch of data
    return data

def write(data):
    # write data to disk

if __name__ == '__main__':
    pool = mp.Pool()
    for j in xrange(200):
        pool.apply_async(compute, args=(j, ), callback=write)
    pool.close()
    pool.join()

pool = mp.Pool()将创建一个工作进程池.默认情况下,工作程序数量等于计算机拥有的CPU内核数量.

pool = mp.Pool() will create a pool of worker processes. By default, the number of workers equals the number of CPU cores your machine has.

每个 pool.apply_async 调用将任务排队由工作进程池中的工作进程运行.有工作线程可用时,它将运行compute(j).当工作程序返回值data时,主进程中的线程将运行回调函数write(data),其中data是工作程序返回的数据.

Each pool.apply_async call queues a task to be run by a worker in the pool of worker processes. When a worker is available, it runs compute(j). When the worker returns a value, data, a thread in the main process runs the callback function write(data), with data being the data returned by the worker.

一些警告:

  • 数据必须是可挑剔的,因为它是从 通过队列返回工作流程.
  • >
  • 不能保证工人完成工作的顺序 任务与任务发送到任务的顺序相同 水池.因此,数据写入磁盘的顺序可能不会 对应于j,范围从0到199.解决此问题的一种方法 将数据写入sqlite(或其他类型)数据库 使用j作为数据字段之一.然后,当您想阅读时 数据按顺序排列,您可以SELECT * FROM table ORDER BY j.
  • 使用多个进程将增加所需的内存量 因为数据是由工作进程生成的,并且等待写入磁盘的数据会累积在队列中.你 通过使用NumPy也许可以减少所需的内存量 数组.如果那不可能,那么您可能必须减少 进程数:

  • The data has to be picklable, since it is being communicated from the worker process back to the main process via a Queue.
  • There is no guarantee that the order in which the workers complete tasks is the same as the order in which the tasks were sent to the pool. So the order in which the data is written to disk may not correspond to j ranging from 0 to 199. One way around this problem would be to write the data to a sqlite (or other kind of) database with j as one of the fields of data. Then, when you wish to read the data in order, you could SELECT * FROM table ORDER BY j.
  • Using multiple processes will increase the amount of memory required as data is generated by the worker processes and data waiting to be written to disk accumulates in the Queue. You might be able to reduce the amount of memory required by using NumPy arrays. If that is not possible, then you might have to reduce the number of processes:

pool = mp.Pool(processes=1) 

这将创建一个工作进程(运行compute),而 运行write的主要过程.由于compute花费的时间比 write,队列中不会备份多于一个的 要写入磁盘的数据.但是,您仍然需要足够的内存 计算一个数据块,同时写入不同的数据块 数据到磁盘.

That will create one worker process (to run compute), leaving the main process to run write. Since compute takes longer than write, the Queue won't get backed up with more than one chunk of data to be written to disk. However, you would still need enough memory to compute on one chunk of data while writing a different chunk of data to disk.

如果您没有足够的内存同时执行这两项操作,那么您别无选择-依次运行computewrite的原始代码是唯一的方法.

If you do not have enough memory to do both simultaneously, then you have no choice -- your original code, which runs compute and write sequentially, is the only way.

这篇关于在Python中将数据写入磁盘作为后台进程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆