如何避免 Pandas DataFrame 中过多的 lambda 函数分配和应用方法链 [英] How to avoid excessive lambda functions in pandas DataFrame assign and apply method chains

查看:51
本文介绍了如何避免 Pandas DataFrame 中过多的 lambda 函数分配和应用方法链的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 R 中数据帧的操作管道转换为其 Python 等效项.管道的一个基本示例如下,结合了一些 mutatefilter 调用:

I am trying to translate a pipeline of manipulations on a dataframe in R over to its Python equivalent. A basic example of the pipeline is as follows, incorporating a few mutate and filter calls:

library(tidyverse)

calc_circle_area <- function(diam) pi / 4 * diam^2
calc_cylinder_vol <- function(area, length) area * length

raw_data <- tibble(cylinder_name=c('a', 'b', 'c'), length=c(3, 5, 9), diam=c(1, 2, 4))

new_table <- raw_data %>% 
  mutate(area = calc_circle_area(diam)) %>% 
  mutate(vol = calc_cylinder_vol(area, length)) %>% 
  mutate(is_small_vol = vol < 100) %>% 
  filter(is_small_vol)

我可以毫不费力地在 Pandas 中复制它,但发现它在使用 assign 执行 apply 时涉及一些嵌套的 lambda 调用(首先数据帧调用者是一个参数,然后是数据帧行作为参数).这往往会模糊赋值调用的含义,如果可能的话,我想在其中指定更切题的内容(例如 R 版本).

I can replicate this in pandas without too much trouble but find that it involves some nested lambda calls when using assign to do an apply (first where the dataframe caller is an argument, and subsequently with dataframe rows as the argument). This tends to obscure the meaning of the assign call, where I would like to specify something more to the point (like the R version) if at all possible.

import pandas as pd
import math

calc_circle_area = lambda diam: math.pi / 4 * diam**2
calc_cylinder_vol = lambda area, length: area * length

raw_data = pd.DataFrame({'cylinder_name': ['a', 'b', 'c'], 'length': [3, 5, 9], 'diam': [1, 2, 4]})

new_table = (
    raw_data
        .assign(area=lambda df: df.diam.apply(lambda r: calc_circle_area(r.diam), axis=1))
        .assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

我知道 .assign(area=lambda df: df.diam.apply(calc_circle_area)) 可以写成 .assign(area=raw_data.diam.apply(calc_circle_area)) 但仅仅是因为 diam 列已经存在于原始数据框中,情况可能并非总是如此.

I am aware that the .assign(area=lambda df: df.diam.apply(calc_circle_area)) could be written as .assign(area=raw_data.diam.apply(calc_circle_area)) but only because the diam column already exists in the original dataframe, which may not always be the case.

我也意识到这里的 calc_... 函数是可向量化的,这意味着我也可以做类似

I also realize that the calc_... functions here are vectorizable, meaning I could also do things like

.assign(area=lambda df: calc_circle_area(df.diam))
.assign(vol=lambda df: calc_cylinder_vol(df.area, df.length))

但同样,由于大多数函数不可向量化,因此在大多数情况下这不起作用.

but again, since most functions aren't vectorizable, this wouldn't work in most cases.

TL;DR 我想知道是否有一种更简洁的方法来变异"数据帧上的列,而不涉及双嵌套 lambda 语句,例如:

TL;DR I am wondering if there is a cleaner way to "mutate" columns on a dataframe that doesn't involve double-nesting lambda statements, like in something like:

.assign(vol=lambda df: df.apply(lambda r: calc_cylinder_vol(r.area, r.length), axis=1))

是否有针对此类应用程序的最佳实践,或者这是在方法链上下文中可以做到的最佳实践吗?

Are there best practices for this type of application or is this the best one can do within the context of method chaining?

推荐答案

最佳实践是矢量化操作.

The best practice is to vectorize operations.

这样做的原因是性能,因为apply 很慢.您已经在 R 代码中利用了矢量化,您应该继续在 Python 中这样做.您会发现,出于对性能的考虑,您需要的大部分函数实际上都是可矢量化的.

The reason for this is performance, because apply is very slow. You are already taking advantage of vectorization in the R code, and you should continue to do so in Python. You will find that, because of this performance consideration, most of the functions you need actually are vectorizable.

这将摆脱你内心的 lambdas.对于 df 上的外部 lambda,我认为您拥有的是最干净的模式.另一种方法是反复重新分配给 raw_data 变量或其他一些中间变量,但这不符合您要求的方法链样式.

That will get rid of your inner lambdas. For the outer lambdas over the df, I think what you have is the cleanest pattern. The alternative is to repeatedly reassign to the raw_data variable, or some other intermediate variables(s), but this doesn't fit the method chaining style for which you are asking.

还有像 dfply 这样的 Python 包,旨在模仿 dplyr 感觉在 Python 中.这些没有获得与核心 pandas 相同水平的支持,所以如果你想走这条路,请记住这一点.

There are also Python packages like dfply that aim to mimic the dplyr feel in Python. These do not receive the same level of support as core pandas will, so keep that in mind if you want to go this route.

或者,如果您只想节省一点打字的时间,并且所有的函数都只在列上,您可以创建一个粘合函数,为您解压列并传递它们.

Or, if you want to just save a bit of typing, and all the functions will be only over columns, you can create a glue function that unpacks the columns for you and passes them along.

def df_apply(col_fn, *col_names):
    def inner_fn(df):
        cols = [df[col] for col in col_names]
        return col_fn(*cols)
    return inner_fn

然后使用最终看起来像这样:

Then usage ends up looking something like this:

new_table = (
    raw_data
        .assign(area=df_apply(calc_circle_area, 'diam'))
        .assign(vol=df_apply(calc_cylinder_vol, 'area', 'length'))
        .assign(is_small_vol=lambda df: df.vol < 100)
        .loc[lambda df: df.is_small_vol]
)

也可以在不利用矢量化的情况下编写它,以防万一.

It is also possible to write this without taking advantage of vectorization, in case that does come up.

def df_apply_unvec(fn, *col_names):
    def inner_fn(df):
        def row_fn(row):
            vals = [row[col] for col in col_names]
            return fn(*vals)
        return df.apply(row_fn, axis=1)
    return inner_fn

为了更加清晰,我使用了命名函数.但是它可以用 lambda 压缩成看起来很像你的原始格式的东西,只是通用的.

I used named functions for extra clarity. But it can be condensed with lambdas into something that looks much like your original format, just generic.

这篇关于如何避免 Pandas DataFrame 中过多的 lambda 函数分配和应用方法链的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆