在Python中使用多处理程序共享大 pandas DataFrame进行循环 [英] Sharing large pandas DataFrame with multiprocessing for loop in Python

查看：875 发布时间：2020/5/13 20:00:48 python pandas parallel-processing multiprocessing

本文介绍了在Python中使用多处理程序共享大 pandas DataFrame进行循环的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Windows机器上使用Python 2.7，我有一个很大的pandas DataFrame(大约700万行和20多个列)，我想通过循环遍历ID来对SQL查询进行过滤，然后对结果进行计算过滤后的数据.我也想并行执行此操作.

Using Python 2.7 on a Windows machine, I have a large pandas DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.

我知道，如果我尝试使用Windows中multiprocessing包中的标准方法执行此操作，则每个进程将生成该大型DataFrame的新实例供其自己使用，并且我的内存将被消耗掉.因此，我尝试使用在远程管理器上阅读的信息来使我的DataFrame成为代理对象，并在每个过程中共享它，但是我正努力使其工作.

I know that if I try to do this with standard methods from the multiprocessing package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.

我的代码在下面，我可以让它在单个for循环上工作没问题，但是如果我将其设为并行进程，那么内存又会被消耗掉:

My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:

import multiprocessing
import pandas
import pyodbc

def download(args):
    """pydobc code to download data from sql database"""

def calc(dataset, index):
    filter_data = dataset[dataset['ID'] == index]
    """run calculations on filtered DataFrame"""
    """append results to local csv"""

if __name__ == '__main__':
    data_1 = download(args_1)
    data_2 = download(args_2)
    all_data = data_1.append(data_2) #Append downloaded DataFrames into one

    unique_id = pandas.unique(all_data['ID'])
    pool = multiprocessing.Pool()
    [pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]

在Python中使用多处理程序共享大 pandas DataFrame进行循环 [英] Sharing large pandas DataFrame with multiprocessing for loop in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中使用多处理程序共享大 pandas DataFrame进行循环 [英] Sharing large pandas DataFrame with multiprocessing for loop in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭