当作为参数传递时,如何跨多个自定义函数处理 Pandas DataFrame? [英] How is Pandas DataFrame handled across multiple custom functions when passed as argument?

查看:38
本文介绍了当作为参数传递时,如何跨多个自定义函数处理 Pandas DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个项目,其中有多个 *.py 脚本,其中包含接收和返回 Pandas 数据帧变量作为参数的函数.

We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.

但这让我想知道:当数据帧变量作为参数或从这些函数返回的变量传递时,它们在内存中的行为是什么?

But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?

修改 df 变量是否也会改变父/主/全局变量?

Does modifying the df variable alters the parent/main/global variable as well?

考虑以下示例:

import pandas as pd

def add_Col(df): 
   df["New Column"] = 10 * 3

def mod_Col(df):
   df["Existing Column"] = df["Existing Column"] ** 2

data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])

add_Col(df)
mod_col(df)

df

最后显示df时:新的Column会出现吗?对现有列"所做的更改如何?调用 mod_col 时?调用 add_Col 函数是创建 df 的副本还是仅创建一个指针?

When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col? Did invoking add_Col function create a copy of df or only a pointer?

将数据帧传递给函数时的最佳做法是什么,因为如果它们足够大,我确信创建副本会对性能和内存产生影响,对吗?

What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?

推荐答案

视情况而定.数据帧是可变对象,所以和列表一样,它们可以在函数内修改,不需要返回对象.

It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.

另一方面,绝大多数 Pandas 操作将返回一个新对象,因此修改不会更改底层 DataFrame.例如,在下面您可以看到使用 .loc 更改值将修改原始值,但如果您将整个 DataFrame(返回一个新对象)相乘,原始值保持不变.

On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.

如果您的函数结合了这两种类型的更改,您可以修改 DataFrame 直到返回新对象.

If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.

改变原来的

df = pd.DataFrame([1,2,4])

def mutate_data(df):
    df.loc[1,0] = 7

mutate_data(df)
print(df)
#   0
#0  1
#1  7
#2  4


不会改变原来的

df = pd.DataFrame([1,2,4])

def mutate_data(df):
    df = df*2

mutate_data(df)
print(df)
#   0
#0  1
#1  2
#2  4


你应该怎么做?

如果函数的目的是修改 DataFrame,例如在管道中,那么您应该创建一个函数,该函数接受 DataFrame 并返回 DataFrame.

If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.

def add_column(df):
    df['new_column'] = 7
    return df


df = add_column(df)
#┃              ┃
#┗ on lhs & rhs ┛

在这种情况下,函数是否更改或创建新对象都没有关系,因为无论如何我们都打算修改原始对象.

In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.

但是,如果您计划写入新对象

df1 = add_column(df)
# |              |
# New Obj        Function still modifies this though!

不需要了解底层源代码的安全替代方法是强制您的函数在顶部复制.因此,在该范围内对 df 的更改不会影响函数外部的原始 df.

A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.

def add_column_maintain_original(df):
    df = df.copy()

    df['new_column'] = 7
    return df

另一种可能是将copy传递给函数:

Another possibility is to pass a copy to the function:

df1 = add_column(df.copy())

这篇关于当作为参数传递时,如何跨多个自定义函数处理 Pandas DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆