Pandas DataFrame将功能应用于多列并输出多列 [英] Pandas DataFrame apply function to multiple columns and output multiple columns

查看:163
本文介绍了Pandas DataFrame将功能应用于多列并输出多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找应用函数的最佳方法,该函数采用多个单独的Pan​​das DataFrame列,并在相同的所说DataFrame中输出多个新列.假设我有以下内容:

I have been scouring SO for the best way of applying a function that takes multiple separate Pandas DataFrame columns and outputs multiple new columns in the same said DataFrame. Let's say I have the following:

def apply_func_to_df(df):
    df[['new_A', 'new_B']] = df.apply(lambda x: transform_func(x['A'], x['B'], x['C']), axis=1)

def transform_func(value_A, value_B, value_C):
    # do some processing and transformation and stuff
    return new_value_A, new_value_B

我正在尝试将上述功能应用于整个DataFrame df ,以便输出2个NEW列.但是,这可以推广到一个用例/函数,该用例/函数接受 n 个DataFrame列,并将 m 个新列输出到同一DataFrame.

I am trying to apply this function as shown above to the whole DataFrame df in order to output 2 NEW columns. However, this can generalize to a usecase/function that takes in n DataFrame columns and outputs m new columns to the same DataFrame.

以下是我一直在关注的事情(取得不同程度的成功):

The following are things I have been looking at (with varying degrees of success):

  • 为函数调用创建Pandas系列,然后附加到现有的DataFrame中,
  • 压缩输出列(但在我当前的实现中会发生一些问题)
  • 重新编写基本函数 transform_func 以明确期望行(即字段) A B C 如下所示,然后将其应用于df:
  • Create a Pandas Series for the function call, then append to the existing DataFrame,
  • Zip the output columns (but there are some issues that happen in my current implementation)
  • Re-write the basic function transform_func to explicitly expect rows (i.e. fields) A, B, C as follows, then do an apply to the df:
def transform_func_mod(df_row):
    # do something with df_row['A'], df_row['B'], df_row['C]
    return new_value_A, new_value_B

我希望以一种非常通用的Python方式来完成此任务,同时兼顾性能(包括内存和时间).我对此表示感谢,因为由于对熊猫不熟悉,我一直在为此苦苦挣扎.

I would like a very general and Pythonic way to accomplish this task, while taking performance into account (both memory- and time-wise). I would appreciate any input on this, as I have been struggling with this due to my unfamiliarity with Pandas.

推荐答案

通过以下方式编写 transform_func :

  • 它应该具有一个参数-当前行,
  • 此功能可以读取当前行中的各个列并充分利用它们,
  • 返回的对象应该是具有以下内容的 Series :
    • 值-您要返回的任何值
    • 索引-目标列名称.

    示例:假设所有3列均为 string 类型,将 A B 列连接在一起,请在 C :

    Example: Assuming that all 3 columns are of string type, concatenate A and B columns, add "some string" to C:

    def transform_func(row):
        a = row.A; b = row.B; c = row.C;
        return pd.Series([ a + b, c + '_xx'], index=['new_A', 'new_B'])
    

    要仅获取新值,请将此功能应用于每一行:

    To get only the new values, apply this function to each row:

    df.apply(transform_func, axis=1)
    

    请注意,生成的DataFrame保留了原始行的键(我们稍后将使用此功能).

    Note that the resulting DataFrame retains keys of the original rows (we will make use of this feature in a moment).

    或者,如果您想将这些新列添加到您的DataFrame中,请加入您的 df 使用上述应用程序的结果,将连接结果保存在原始的 df :

    Or if you want to add these new columns to your DataFrame, join your df with the result of the above application, saving the join result under the original df:

    df = df.join(df.apply(transform_func, axis=1))
    

    按照截至03:36:34Z的评论进行编辑

    使用 zip 可能是最慢的选择.基于行的功能应该更快,并且结构更直观.最快的方法可能是为每列分别编写2个向量化表达式.在这种情况下,类似:

    Edit following the comment as of 03:36:34Z

    Using zip is probably the slowest option. Row-based function should be quicker and it is a more intuitive construction. Probably the quickest way is to write 2 vectorized expressions, for each column separately. In this case something like:

    df['new_A'] = df.A + df.B
    df['new_B'] = df.C + '_xx'
    

    但是通常问题是是否基于行的函数可以表示为向量化的表达式(就像我上面所做的那样).在负"情况下,您可以应用基于行的函数.

    But generally the problem is whether a row-based function can be expressed as vectorized expressions (as I did above). In the "negative" case you can apply a row-based function.

    要比较每个解决方案的速度,请使用%timeit .

    To compare how quick is each solution, use %timeit.

    这篇关于Pandas DataFrame将功能应用于多列并输出多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆