同时计算pandas数据帧 [英] Compute on pandas dataframe concurrently

查看:209
本文介绍了同时计算pandas数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在pandas中同时进行多个group-wise计算的数据帧是否可行,并得到那些结果?因此,我想计算以下几组dataframe并逐个获得这些结果,最后将它们合并到一个数据帧。

Is it feasible to do multiple group-wise calculation in dataframe in pandas concurrently and get those results back? So, I'd like to compute the following sets of dataframe and get those results one-by-one, and finally merge them into one dataframe.

df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"]))
df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"]))
df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())

然后,

df_final = merge(df_a, df_b) # omitting the irrelevant part

然而,据我所知, multiprocessing 不适合我的需要在这里,因为它看起来更像是并发运行多个函数,不返回内部创建的局部变量,而是只打印一些输出内的函数(例如oft-used is_prime 函数),或者同时运行具有不同参数集合的单个函数(例如 map c> multiprocessing ),如果我理解它是正确的(我不知道我理解它,正确,所以,如果我错了!)。

However, as far as I know, those functionalities at multiprocessing don't fit my needs here, since it looks more like concurrently run multiple functions that don't return the internally-created, local variables, and instead just print some output within the function (e.g. oft-used is_prime function), or concurrently run a single function with different sets of arguments (e.g. map function in multiprocessing), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm wrong!).

然而,我想实现的是只是同时运行这三个(实际上,更多),最后将它们合并在一起,一旦所有的数据帧上的计算成功完成。我假设在 Go goroutines channels ),可以分别创建每个函数,一个接一个地运行它们,同时等待所有的完成,最后将它们合并在一起。

However, what I'd like to implement is just run those three (and actually, more) simultaneously and finally merge them together, once all of those computation on dataframe are completed successfully. I assume the kind of functionalities implemented in Go (goroutines and channels), by perhaps creating each function respectively, running them one-by-one, concurrently, waiting for all of them completed, and finally merging them together.

如何用Python编写?我阅读了 multiprocessing threading concurrent.futures ,但所有这些对我来说太难以捉摸了,我甚至不知道我是否可以利用这些库开始...

So how can it be written in Python? I read the documentation of multiprocessing, threading, and concurrent.futures, but all of them are too elusive for me, that I don't even understand whether I can utilize those libraries to begin with...

代码精确为简洁的目的和实际的代码是更复杂的,所以请不要回答是的,你可以写在一行和非并发的方式或类似的东西。)

(I made the code precise for the purpose of brevity and the actual code is more complicated, so please don't answer "Yeah, you can write it in one line and in non-concurrent way" or something like that.)

感谢。

推荐答案

9个月后,这仍然是使用多处理和大熊猫。我希望你在这一点上找到了一些类型的答案,但如果不是我有一个似乎工作,希望它会帮助其他人看到这个问题。

9 Months later and this is still one of the top results for working with multiprocessing and pandas. I hope you've found some type of answer at this point, but if not I've got one that seems to work and hopefully it will help others who view this question.

import pandas as pd
import numpy as np
#sample data
df = pd.DataFrame([[1,2,3,1,2,3,1,2,3,1],[2,2,2,2,2,2,2,2,2,2],[1,3,5,7,9,2,4,6,8,0],[2,4,6,8,0,1,3,5,7,9]]).transpose()
df.columns=['a','b','c','d']
df

   a  b  c  d
0  1  2  1  2
1  2  2  3  4
2  3  2  5  6
3  1  2  7  8
4  2  2  9  0
5  3  2  2  1
6  1  2  4  3
7  2  2  6  5
8  3  2  8  7
9  1  2  0  9


#this one function does the three functions you had used in your question, obviously you could add more functions or different ones for different groupby things
def f(x):
    return [np.mean(x[1]['c']),np.mean(x[1]['d']),x[1]['d'].sum()]

#sets up a pool with 4 cpus
from multiprocessing import Pool
pool = Pool(4)

#runs the statistics you wanted on each group
group_df = pd.DataFrame(pool.map(f,df.groupby(['a','b'])))
group_df
   0         1   2
0  3  5.500000  22
1  6  3.000000   9
2  5  4.666667  14

group_df['keys']=df.groupby(['a','b']).groups.keys()

group_df
   0         1   2    keys
0  3  5.500000  22  (1, 2)
1  6  3.000000   9  (3, 2)
2  5  4.666667  14  (2, 2)

至少我希望这可以帮助在未来看这个东西的人

At the least I hope this helps someone who's looking at this stuff in the future

这篇关于同时计算pandas数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆