以编程方式将Pandas数据框切片 [英] Programmatically slice a Pandas dataframe in place

查看:91
本文介绍了以编程方式将Pandas数据框切片的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆数据框,我试图对其进行切片并将其分配回原始名称。但是我发现存在名称空间问题。以下是我所拥有的。

I have a bunch of dataframes that I am trying to slice and assign back to the original names. But I am finding that there is a namespace issue. Below is what I have.

import pandas as pd
import numpy as np

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist =[df_a, df_b]

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts = dfts[idx:]
        print(dfts.head)

truncate_before(mylist, 11)
print(df_a)

在truncate_before函数内的print语句中,它显示3行,分别对应于第11、12和13行。但是最终的打印语句显示第0至13行。

In the print statements within the truncate_before function, it shows 3 rows, corresponding to the 11th, 12th and 13th row. But the final print statement shows 0th to 13th rows.

因此在函数之外,它恢复为原始数据帧。我的印象是Python通过引用传递参数。我缺少什么?

So outside the function, it reverts back to the original dataframes. I was under the impression that Python passes arguments by reference. What am I missing?

推荐答案

truncate_before 中:

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts = dfts[idx:]
        print(dfts.head)

for循环创建一个局部变量 dfts ,该变量引用 list_of_dfts 中的数据框。但是

the for-loop creates a local variable dfts which references the DataFrames in list_of_dfts. But

        dfts = dfts[idx:]

重新分配 dfts 的新值。它不会更改 list_of_dfts 中的DataFrame。

reassigns a new value to dfts. It does not change the DataFrame in list_of_dfts.

请参见有关Python名称和值的事实和神话很好地解释了变量名如何绑定到值,以及哪些操作会更改值,而不是将新值绑定到变量名。

See Facts and myths about Python names and values for a great explanation of how variable names bind to values, and what operations change values versus binding new values to variable names.

这里有很多选择:

修改列表

def truncate_before(list_of_dfts, idx):
    list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
    for dfts in list_of_dfts:
        print(dfts.head)

因为分配给 list_of_dfts [:] (这称为 list_of_dfts .__ setitem __ )就地更改 list_of_dfts 的内容。

since assigning to list_of_dfts[:] (which calls list_of_dfts.__setitem__) changes the contents of list_of_dfts in-place.

import numpy as np
import pandas as pd

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist = [df_a, df_b]

def truncate_before(list_of_dfts, idx):
    list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]

print(mylist[0])
truncate_before(mylist, 11)
print(mylist[0])

显示 mylist [0] 已被截断。请注意,但是 df_a 仍引用原始DataFrame。

shows mylist[0] has been truncated. Note that df_a still references the original DataFrame, however.

返回列表并将我的列表 df_a,df_b 重新分配给结果

Return the list and reassign mylist or df_a, df_b to the result

使用返回值可能不需要就地修改 mylist

Using return values may make it unnecessary to modify mylist in-place.

重新分配全局变量 df_a df_b 为新值,您可以使
truncate_before 返回DataFrames的列表,然后重新分配 df_a df_b
到返回值:

To reassign the global variables df_a, df_b to a new values, you could make truncate_before return the list of DataFrames, and reassign df_a and df_b to the returned value:

def truncate_before(list_of_dfts, idx):
    return [dfts[idx:] for dfts in list_of_dfts]

mylist = truncate_before(mylist, 11)   # or
# df_a, df_b = truncate_before(mylist, 11) # or
# mylist = df_a, df_b = truncate_before(mylist, 11)  

但是请注意通过两个访问DataFrame可能不好mylist df_a df_b ,因为如上例所示,这些值不会保持协调一致。使用 mylist 就足够了。

But note that it is probably not good to access the DataFrames through both mylist and df_a and df_b, since as the example above shows, the values do not stay coordinated automagically. Using mylist should suffice.

使用带有inplace参数的DataFrame方法,例如 df.drop

Use a DataFrame method with the inplace parameter, such as df.drop

dfts.drop (其中有 = True )修改 dfts 本身:

dfts.drop (with inplace=True) modifies dfts itself:

import numpy as np
import pandas as pd

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist = [df_a, df_b]

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts.drop(dfts.index[:idx], inplace=True)

truncate_before(mylist, 11)
print(mylist[0])
print(df_a)

修改 dfts 就位,同时 mylist df_a df_b
同时更改。

By modifying dfts inplace, both the values in mylist and df_a and df_b get changed at the same time.

dfts.drop 根据索引标签值删除行。因此,以上内容依靠
(假设) dfts.index 是唯一的。如果 dfts.index 不是唯一的,则
dfts.drop 可能比 idx 行。例如,

Note that dfts.drop drops rows based on index label value. So the above relies (assumes) that dfts.index is unique. If dfts.index is not unique, dfts.drop may more rows than idx rows. For example,

df = pd.DataFrame([1,2], index=['A', 'A'])
df.drop(['A'], inplace=True)

丢弃两个行都将 df 变成一个空的DataFrame。

drops both rows making df an empty DataFrame.

还要注意熊猫人核心的警告关于使用就地使用的开发人员:

Note also this warning from Pandas' core developer regarding the use of inplace:


我个人的观点:我从不使用放置操作。
的语法更难阅读,并且没有任何优势。

My personal opinion: I never use in-place operations. The syntax is harder to read and its does not offer any advantages.

这可能是因为 dfts.drop 创建一个新的数据框,然后
调用 _update_inplace 私有方法将新数据分配给
旧DataFrame:

This is probably because under the hood, dfts.drop creates a new dataframe and then calls the _update_inplace private method to assign the new data to the old DataFrame:

def _update_inplace(self, result, verify_is_copy=True):
    """
    replace self internals with result.
    ...
    """
    self._reset_cache()
    self._clear_item_cache()
    self._data = getattr(result,'_data',result)
    self._maybe_update_cacher(verify_is_copy=verify_is_copy)

由于必须始终创建临时 ,因此与简单的重新分配相比,就地操作没有存储或性能上的好处。

Since the temporary result had to be created anyway, there is no memory or performance benefit of "in-place" operations over simple reassignment.

这篇关于以编程方式将Pandas数据框切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆