有条件创建 pandas 列的最快方法 [英] Fastest way to create a pandas column conditionally

查看:114
本文介绍了有条件创建 pandas 列的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Pandas DataFrame中,我想根据另一列的值有条件地创建一个新列.在我的应用程序中,DataFrame通常有几百万行,并且唯一条件值的数量很少,大约为1.性能非常重要:生成新列的最快方法是什么?

In a Pandas DataFrame, I want to create a new column conditionally based on the value of another column. In my application, the DataFrame typically has a few million lines, and the number of unique conditional values is small, on the order of unity. Performance is extremely important: what is the fastest way to generate the new column?

我在下面创建了一个示例案例,并且已经尝试并比较了不同的方法. 在示例中,条件填充由 根据列label的值进行字典查找(此处为1, 2, 3之一).

I created an example case below, and tried and compared different methods already. In the example, the conditional filling is represented by a dictionary lookup based on the value of the column label (here: one of 1, 2, 3).

lookup_dict = {
    1: 100,   # arbitrary
    2: 200,   # arbitrary
    3: 300,   # arbitrary
    }

然后我希望我的DataFrame填充为:

I then expect my DataFrame to be filled as:

       label  output
0      3     300
1      2     200
2      3     300
3      3     300
4      2     200
5      2     200
6      1     100
7      1     100


下面是在1000万行上测试的6种不同方法(测试代码中的参数Nlines):

  • 方法1:pandas.groupby().apply()
  • 方法2:pandas.groupby().indices.items()
  • 方法3:pandas.Series.map
  • 方法4:在标签上循环
  • 方法5:numpy.select
  • 方法6:numba
  • method 1: pandas.groupby().apply()
  • method 2: pandas.groupby().indices.items()
  • method 3: pandas.Series.map
  • method 4: for loop on labels
  • method 5: numpy.select
  • method 6: numba

完整的代码可在答案的末尾获得,其中包含所有方法的运行时.在比较性能之前,断言每种方法的输出均相等.

The full code is available at the end of the answer, with the runtimes of all methods. The output of every method is asserted to be equal before performances are compared.

我在label上使用pandas.groupby(),然后使用apply()用相同的值填充每个块.

I use pandas.groupby() on the label, then fills each block with the same value using apply().

def fill_output(r):
    ''' called by groupby().apply(): all r.label values are the same '''
    r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
    return r

df = df.groupby('label').apply(fill_output)

我知道

>>> method_1_groupby ran in 2.29s (average over 3 iterations)

请注意,groupby().apply()在第一个组上运行了两次,以确定要使用的代码路径(请参阅

Note that groupby().apply() is ran twice on the first group to determine which code path to use (see Pandas #2936). This can slow things down for a small number of groups. I tricked the Method 1 can adding a first dummy group, but I didn't get much improvement.

第二个变量:不是使用apply,而是直接使用groupby().indices.items()访问索引.最终速度是方法1的两倍,这是我使用了很长时间的方法

Second is a variant: instead of using apply I access the indices directy with groupby().indices.items(). This ends up to be twice as fast as Method 1, and it's the method I've used for a long time

dgb = df.groupby('label')
for label, idx in dgb.indices.items():
    df.loc[idx, 'output'] = lookup_dict[label]

知道:

method_2_indices ran in 1.21s (average over 3 iterations)

方法3:pandas.Series.map

我使用了 Pandas.Series.map .

method 3: pandas.Series.map

I used Pandas.Series.map.

df['output'] = df.label.map(lookup_dict.get)

在类似情况下,查找值的数量与行数相当,因此我获得了很好的结果.在当前情况下,map的速度最终是方法1的两倍.

I had very good results in similar cases where the number of looked up values was comparable with the number of lines. In the present case, map ends up being twice as slow as Method 1.

method_3_map的运行时间为3.07秒(平均3次迭代)

method_3_map ran in 3.07s (average over 3 iterations)

我将其归因于少量的查找值,但是实现它的方式可能只是一个问题.

I attribute that to the small number of look up values, but there may just be an issue with the way I implemented it.

第4种方法非常幼稚:我只遍历所有标签并选择DataFrame的匹配部分.

The 4th method is quite naive: I just loop over all labels and select the matching part of the DataFrame.

for label, value in lookup_dict.items():
    df.loc[df.label == label, 'output'] = value

但是,令人惊讶的是,我得到的结果比以前的情况要快得多.我希望基于groupby的解决方案比该解决方案更快,因为熊猫公司必须在此处与df.label == label进行三个比较.结果证明我错了:

Surprisingly, though, I ended up with much faster results that in the previous cases. I expected the groupby based solutions to be faster than this one, because Pandas has to make three comparisons with df.label == label here. Results prove me wrong:

method_4_forloop ran in 0.54s (average over 3 iterations)

方法5:numpy.select

第五种方法基于此 StackOverflow答案使用numpy select函数.

method 5: numpy.select

Fifth method uses the numpy select function, based on this StackOverflow answer.

conditions = [df.label == k for k in lookup_dict.keys()]
choices = list(lookup_dict.values())

df['output'] = np.select(conditions, choices)

这将产生最佳结果:

method_5_select ran in 0.29s (average over 3 iterations)

最终,我在方法6中尝试了numba方法.

Eventually, I tried a numba approach in Method 6.

仅出于示例的目的,条件填充值是已编译函数中的硬代码.我不知道如何给Numba一个列表作为运行时常量:

Just for the sake of the example, the conditional filling values are hardcode in the compiled function. I don't know how to give Numba a list as a runtime constant:

@jit(int64[:](int64[:]), nopython=True)
def hardcoded_conditional_filling(column):
    output = np.zeros_like(column)
    i = 0
    for c in column:
        if c == 1:
            output[i] = 100
        elif c == 2:
            output[i] = 200
        elif c == 3:
            output[i] = 300
        i += 1
    return output

df['output'] = hardcoded_conditional_filling(df.label.values)

我得到了最好的时间,比方法5快了50%.

I ended up with the best time, faster than Method 5 by 50%.

method_6_numba ran in 0.19s (average over 3 iterations)

出于上述原因,我尚未实现这一目标:我不知道如何在不大幅降低性能的情况下为Numba提供一个列表作为运行时常量.

I haven't implemented this one for the reason stated above: I don't know how to give Numba a list as a runtime constant without a major drop in performances.

import pandas as pd
import numpy as np
from timeit import timeit
from numba import jit, int64

lookup_dict = {
        1: 100,   # arbitrary
        2: 200,   # arbitrary
        3: 300,   # arbitrary
        }

Nlines = int(1e7)

# Generate 
label = np.round(np.random.rand(Nlines)*2+1).astype(np.int64)
df0 = pd.DataFrame(label, columns=['label'])

# Now the goal is to assign the look_up_dict values to a new column 'output' 
# based on the value of label

# Method 1
# using groupby().apply()

def method_1_groupby(df):

    def fill_output(r):
        ''' called by groupby().apply(): all r.label values are the same '''
        #print(r.iloc[0]['label'])   # activate to reveal the #2936 issue in Pandas
        r.loc[:, 'output'] = lookup_dict[r.iloc[0]['label']]
        return r

    df = df.groupby('label').apply(fill_output)
    return df 

def method_2_indices(df):

    dgb = df.groupby('label')
    for label, idx in dgb.indices.items():
        df.loc[idx, 'output'] = lookup_dict[label]

    return df

def method_3_map(df):

    df['output'] = df.label.map(lookup_dict.get)

    return df

def method_4_forloop(df):
    ''' naive '''

    for label, value in lookup_dict.items():
        df.loc[df.label == label, 'output'] = value

    return df

def method_5_select(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825
    '''

    conditions = [df.label == k for k in lookup_dict.keys()]
    choices = list(lookup_dict.values())

    df['output'] = np.select(conditions, choices)

    return df

def method_6_numba(df):
    ''' This works, but it is hardcoded and i don't really know how
    to make it compile with list as runtime constants'''


    @jit(int64[:](int64[:]), nopython=True)
    def hardcoded_conditional_filling(column):
        output = np.zeros_like(column)
        i = 0
        for c in column:
            if c == 1:
                output[i] = 100
            elif c == 2:
                output[i] = 200
            elif c == 3:
                output[i] = 300
            i += 1
        return output

    df['output'] = hardcoded_conditional_filling(df.label.values)

    return df

df1 = method_1_groupby(df0)
df2 = method_2_indices(df0.copy())
df3 = method_3_map(df0.copy())
df4 = method_4_forloop(df0.copy())
df5 = method_5_select(df0.copy())
df6 = method_6_numba(df0.copy())

# make sure we havent modified the input (would bias the results)
assert 'output' not in df0.columns 

# Test validity
assert (df1 == df2).all().all()
assert (df1 == df3).all().all()
assert (df1 == df4).all().all()
assert (df1 == df5).all().all()
assert (df1 == df6).all().all()

# Compare performances
Nites = 3
print('Compare performances for {0:.1g} lines'.format(Nlines))
print('-'*30)
for method in [
               'method_1_groupby', 'method_2_indices', 
               'method_3_map', 'method_4_forloop', 
               'method_5_select', 'method_6_numba']:
    print('{0} ran in {1:.2f}s (average over {2} iterations)'.format(
            method, 
            timeit("{0}(df)".format(method), setup="from __main__ import df0, {0}; df=df0.copy()".format(method), number=Nites)/Nites,
            Nites))

输出:

Compare performances for 1e+07 lines
------------------------------
method_1_groupby ran in 2.29s (average over 3 iterations)
method_2_indices ran in 1.21s (average over 3 iterations)
method_3_map ran in 3.07s (average over 3 iterations)
method_4_forloop ran in 0.54s (average over 3 iterations)
method_5_select ran in 0.29s (average over 3 iterations)
method_6_numba ran in 0.19s (average over 3 iterations)


我会对可能产生更好性能的任何其他解决方案感兴趣. 我原本是在寻找基于Pandas的方法,但是我也接受基于numba/cython的解决方案.


I'd be interested in any other solution that could yield better performances. I was originally looking for Pandas based methods, but I accept numba/cython based solutions too.

添加 Chrisb的方法进行比较:

def method_3b_mapdirect(df):
    ''' Suggested by https://stackoverflow.com/a/51388828/5622825'''

    df['output'] = df.label.map(lookup_dict)

    return df

def method_7_take(df):
    ''' Based on answer from 
    https://stackoverflow.com/a/19913845/5622825

    Exploiting that labels are continuous integers
    '''

    lookup_arr = np.array(list(lookup_dict.values()))
    df['output'] = lookup_arr.take(df['label'] - 1)

    return df

运行时为:

method_3_mapdirect ran in 0.23s (average over 3 iterations)
method_7_take ran in 0.11s (average over 3 iterations)

哪个方法使#3的速度比其他任何方法(除#6之外)都要快,而且也是最优雅的方法.如果您的用例兼容,则使用#7.

Which makes #3 faster than any other method (#6 aside), and the most elegant too. Use #7 if your user case is compatible.

推荐答案

我会考虑.map(#3)这样做的惯用方式-但不要通过.get-自己使用字典,并且应该会看到相当大的进步.

I'd consider .map (#3) the idiomatic way to do this - but don't pass the .get - use the dictionary by itself, and should see a pretty significant improvement.

df = pd.DataFrame({'label': np.random.randint(, 4, size=1000000, dtype='i8')})

%timeit df['output'] = df.label.map(lookup_dict.get)
261 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df['output'] = df.label.map(lookup_dict)
69.6 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

如果条件数量少,并且比较便宜(即int和您的查询表),则直接比较值(尤其是5和4)要比.map快,但这并不总是正确的,例如如果您有一组字符串.

If the number of conditions is small, and the comparison cheap (i.e. ints and your lookup table), direct comparison of the values (4 and especially 5) is faster than .map, but this wouldn't always true, e.g. if you had a set of strings.

如果您的查找标签确实是连续的整数,则可以利用它并使用take查找,该速度应该与numba一样快.我认为这基本上是可以做到的-可以用cython编写等效的代码,但是不会更快.

If your lookup labels really are contigous integers, you can exploit this and lookup using a take, which should be about as fast as numba. I think this is basically as fast as this can go - could write the the equivalent in cython, but won't be quicker.

%%timeit
lookup_arr = np.array(list(lookup_dict.values()))
df['output'] = lookup_arr.take(df['label'] - 1)
8.68 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

这篇关于有条件创建 pandas 列的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆