在Python中使用多处理程序共享大 pandas DataFrame进行循环 [英] Sharing large pandas DataFrame with multiprocessing for loop in Python

查看:875
本文介绍了在Python中使用多处理程序共享大 pandas DataFrame进行循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Windows机器上使用Python 2.7,我有一个很大的pandas DataFrame(大约700万行和20多个列),我想通过循环遍历ID来对SQL查询进行过滤,然后对结果进行计算过滤后的数据.我也想并行执行此操作.

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

我知道,如果我尝试使用Windows中multiprocessing包中的标准方法执行此操作,则每个进程将生成该大型DataFrame的新实例供其自己使用,并且我的内存将被消耗掉.因此,我尝试使用在远程管理器上阅读的信息来使我的DataFrame成为代理对象,并在每个过程中共享它,但是我正努力使其工作.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

我的代码在下面,我可以让它在单个for循环上工作没问题,但是如果我将其设为并行进程,那么内存又会被消耗掉:

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing
import pandas
import pyodbc

def download(args):
    """pydobc code to download data from sql database"""

def calc(dataset, index):
    filter_data = dataset[dataset['ID'] == index]
    """run calculations on filtered DataFrame"""
    """append results to local csv"""

if __name__ == '__main__':
    data_1 = download(args_1)
    data_2 = download(args_2)
    all_data = data_1.append(data_2) #Append downloaded DataFrames into one

    unique_id = pandas.unique(all_data['ID'])
    pool = multiprocessing.Pool()
    [pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

推荐答案

Q :" 共享大熊猫DataFrame 具有多处理功能在Python中循环吗?"

Q : "Sharing large pandas DataFrame with multiprocessing for loop in Python ?"

虽然在 multiprocessing 模块中有一些工具可以共享某些数据,但实际使用这里实际上代表了 反模式 出于性能方面的考虑,演示文稿将在Pool实例内部以正好" - [CONCURRENT] -时尚进行操作.

While there are tools to share some data in the multiprocessing module, the actual use will here actually represent an anti-pattern to the presented will to operate this, for performance reasons, inside a Pool-instance, in a "just"-[CONCURRENT]-fashion.

您花费巨大的成本将筛选工作转移到独立的Pool个工作人员("just" - [CONCURRENT] )工作人员中,但每个工作人员都是等待再次被中央GIL锁服务,这又将Manager的工作再次变成了纯 [SERIAL] ,更糟糕的是,受RAM I/O约束,由于无法免费访问RAM而导致性能窒息,基本上是朝错误的方向进行的.)

You spend immense costs on moving the filtering into a Pool-of-independent ( "just"-[CONCURRENT] ) workers, yet each of them is waiting to get served by, again the central GIL-lock, which turns the Manager's work again into a pure-[SERIAL] and even worse, being RAM I/O-bound, the performance suffocation from having no free access to RAM, goes principally in a wrong direction ).

一些SLOC-s看不到的烧钱速度(附加费用)可能更高(而且通常 ),比在"just" - [CONCURRENT] (对于True- [PARALLEL] 来说更难)

The speed of burning the money ( add-on costs ), that are not visible from a few SLOC-s can be ( and often is) way higher, than any ( only potential, until well engineered, tuned and validated ) in-vivo performance benefit, from operating several lines of code-execution in a "just"-[CONCURRENT] ( the harder for a True-[PARALLEL] ) fashion.

这篇关于在Python中使用多处理程序共享大 pandas DataFrame进行循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆