pandas -关于应用功能的解释很慢 [英] Pandas - Explanation on apply function being slow

查看:68
本文介绍了 pandas -关于应用功能的解释很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大型数据帧(大约1到300万行)中,应用功能的运行速度似乎非常慢.

我已经在此处检查了相关问题,例如加快熊猫应用功能,并且在pandas apply()函数中进行计数,看来加快速度的最佳方法是不使用Apply功能:)

对于我来说,我有两项与apply函数有关的任务.

首先:与查询字典查询一起应用

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

第二:通过groupby申请

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

对于第一种情况,我想可以通过数据框联接来完成,而我不确定大型数据集上这种联接的资源成本.

我的问题是:

  1. 在上述两种情况下,有什么方法可以替代吗?
  2. 为什么申请这么慢?对于dict查找情况,我认为应该为O(N),即使N为100万也不必花那么多钱.

解决方案

关于您的第一个问题,我无法确切说明为什么此实例运行缓慢.但通常,apply不利用矢量化的优势.另外,apply返回一个新的Series或DataFrame对象,因此对于非常大的DataFrame,您将有相当大的IO开销(由于Pandas具有内部实现优化负载,因此我不能保证100%的情况是这种情况).

对于第一种方法,我假设您正在尝试使用p_dict作为查找表来填充df中的值"列.使用pd.merge的速度大约快1000倍:

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 µs per loop

关于第二个任务,我们可以快速地在p_dict中添加新列,以计算时间窗口从min_week_num开始并在p_dict中该行的星期结束的平均值.这要求p_dict沿WEEK列按升序排序.然后,您可以再次使用pd.merge.

在下面的示例中,我假设min_week_num为0.但是您可以轻松地修改rolling_growing_mean以采用其他值. rolling_growing_mean方法将在O(n)中运行,因为它在每次迭代中执行固定数量的操作.

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

For my case, I have two kinds of tasks to do with the apply function.

First: apply with lookup dict query

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

My question is:

  1. Is there any way to substitute apply in the two above cases?
  2. Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.

解决方案

Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 µs per loop

Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.

I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')

这篇关于 pandas -关于应用功能的解释很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆