python multiprocessing:写入相同的excel文件 [英] python multiprocessing: write to same excel file

查看:862
本文介绍了python multiprocessing:写入相同的excel文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python的新手,我试图将5个不同过程的结果保存到一个excel文件中(每个过程写入不同的工作表).我在这里阅读了不同的文章,但是仍然无法完成它,因为我对pool.map,队列和锁感到非常困惑,而且我不确定要完成此任务需要什么. 到目前为止,这是我的代码:

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task. This is my code so far:

list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()

if __name__ == '__main__':
    global list_of_days
    writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
    nr_of_cores = multiprocessing.cpu_count()
    l = multiprocessing.Lock()
    pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
    pool.map(f, range(len(list_of_days)))
    pool.close()
    pool.join()

def init(l):
    global lock
    lock = l

def f(k):
    global results

    *** DO SOME STUFF HERE***

    results = results[ *** finished pandas dataframe *** ]

    lock.acquire()
    results.to_excel(writer, sheet_name=list_of_days[k])
    writer.save()
    lock.release()

结果是在excel中只创建了一张图纸(我认为这是最后完成的过程).有关此代码的一些问题:

The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:

  • 如何避免定义全局变量?
  • 是否甚至可以传递数据帧?
  • 我应该将锁移到main上吗?

在此我真的很感谢一些输入,因为我认为精通多重处理是有帮助的.谢谢

Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks

推荐答案

1)为什么在第二种方法中的多个地方实现time.sleep?

1) Why did you implement time.sleep in several places in your 2nd method?

__main__中的time.sleep(0.1)中,为启动的process提供启动的时间片.
f2(fq, q)中,给queue一个时间片以将所有缓冲的数据刷新到管道中,并 使用q.get_nowait().
w(q)中,仅用于模拟writer.to_excel(...)的长期运行的测试, 我删除了这个.

In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...), i removed this one.

2)pool.map和pool = [mp.Process(.)]有什么区别?

2) What is the difference between pool.map and pool = [mp.Process( . )]?

使用pool.map不需要Queue,无需传递参数,代码较短. worker_process必须立即返回result并终止. 只要所有iteration完成,pool.map就开始新的过程. 此后必须处理results.

Using pool.map needs no Queue, no parameter passed, shorter code. The worker_process have to return immediately the result and terminates. pool.map starts a new process as long as all iteration are done. The results have to be processed after that.

使用pool = [mp.Process( . )],启动 n processes. process终止于queue.Empty

Using pool = [mp.Process( . )], starts n processes. A process terminates on queue.Empty

您能想到一种情况,您会偏爱一种方法而不是另一种方法吗?

Can you think of a situation where you would prefer one method over the other?

方法1:快速设置,已序列化,只对要继续的结果感兴趣.
方法2:如果您要并行来完成所有工作负载.

Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.

不能在流程中使用global writer.
writer实例必须属于一个process.

You could't use global writer in processes.
The writer instance has to belong to one process.

mp.Pool的用法,例如:

def f1(k):
  # *** DO SOME STUFF HERE***
  results = pd.DataFrame(df_)
  return results

if __name__ == '__main__':
    pool = mp.Pool()
    results = pool.map(f1, range(len(list_of_days)))

    writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
    for k, result in enumerate(results):
        result.to_excel(writer, sheet_name=list_of_days[k])

    writer.save()
    pool.close()

这导致.to_excel(...)__main__进程中被依次调用.

This leads to .to_excel(...) are called in sequence in the __main__ process.

如果要并行.to_excel(...),则必须使用mp.Queue().
例如:

If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:

worker流程:

# mp.Queue exeptions have to load from
try:
    # Python3
    import queue
except:
    # Python 2
    import Queue as queue

def f2(fq, q):
    while True:
        try:
            k = fq.get_nowait()
        except queue.Empty:
            exit(0)

        # *** DO SOME STUFF HERE***

        results = pd.DataFrame(df_)
        q.put( (list_of_days[k], results) )
        time.sleep(0.1)  

writer流程:

def w(q):
    writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
    while True:
        try:
            titel, result = q.get()
        except ValueError:
            writer.save()
            exit(0)

        result.to_excel(writer, sheet_name=titel)

__main__进程:

if __name__ == '__main__':
    w_q = mp.Queue()
    w_p = mp.Process(target=w, args=(w_q,))
    w_p.start()
    time.sleep(0.1)

    f_q = mp.Queue()
    for i in range(len(list_of_days)):
        f_q.put(i)

    pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
    for p in pool:
        p.start()
        time.sleep(0.1)

    for p in pool:
        p.join()

    w_q.put('STOP')
    w_p.join()

使用Python:3.4.2测试-熊猫:0.19.2-xlsxwriter:0.9.6

Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

这篇关于python multiprocessing:写入相同的excel文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆