以编程方式将Pandas数据框切片 [英] Programmatically slice a Pandas dataframe in place
问题描述
我有一堆数据框,我试图对其进行切片并将其分配回原始名称。但是我发现存在名称空间问题。以下是我所拥有的。
I have a bunch of dataframes that I am trying to slice and assign back to the original names. But I am finding that there is a namespace issue. Below is what I have.
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
mylist =[df_a, df_b]
def truncate_before(list_of_dfts, idx):
for dfts in list_of_dfts:
dfts = dfts[idx:]
print(dfts.head)
truncate_before(mylist, 11)
print(df_a)
在truncate_before函数内的print语句中,它显示3行,分别对应于第11、12和13行。但是最终的打印语句显示第0至13行。
In the print statements within the truncate_before function, it shows 3 rows, corresponding to the 11th, 12th and 13th row. But the final print statement shows 0th to 13th rows.
因此在函数之外,它恢复为原始数据帧。我的印象是Python通过引用传递参数。我缺少什么?
So outside the function, it reverts back to the original dataframes. I was under the impression that Python passes arguments by reference. What am I missing?
推荐答案
在 truncate_before
中:
def truncate_before(list_of_dfts, idx):
for dfts in list_of_dfts:
dfts = dfts[idx:]
print(dfts.head)
for循环
创建一个局部变量 dfts
,该变量引用 list_of_dfts
中的数据框。但是
the for-loop
creates a local variable dfts
which references the DataFrames in list_of_dfts
. But
dfts = dfts[idx:]
重新分配为 dfts
的新值。它不会更改 list_of_dfts
中的DataFrame。
reassigns a new value to dfts
. It does not change the DataFrame in list_of_dfts
.
请参见有关Python名称和值的事实和神话很好地解释了变量名如何绑定到值,以及哪些操作会更改值,而不是将新值绑定到变量名。
See Facts and myths about Python names and values for a great explanation of how variable names bind to values, and what operations change values versus binding new values to variable names.
这里有很多选择:
修改列表
def truncate_before(list_of_dfts, idx):
list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
for dfts in list_of_dfts:
print(dfts.head)
因为分配给 list_of_dfts [:]
(这称为 list_of_dfts .__ setitem __
)就地更改 list_of_dfts
的内容。
since assigning to list_of_dfts[:]
(which calls list_of_dfts.__setitem__
) changes the contents of list_of_dfts
in-place.
import numpy as np
import pandas as pd
df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
mylist = [df_a, df_b]
def truncate_before(list_of_dfts, idx):
list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
print(mylist[0])
truncate_before(mylist, 11)
print(mylist[0])
显示 mylist [0]
已被截断。请注意,但是 df_a
仍引用原始DataFrame。
shows mylist[0]
has been truncated. Note that df_a
still references the original DataFrame, however.
返回列表并将我的列表
或 df_a,df_b
重新分配给结果
Return the list and reassign mylist
or df_a, df_b
to the result
使用返回值可能不需要就地修改 mylist
。
Using return values may make it unnecessary to modify mylist
in-place.
要重新分配全局变量 df_a
, df_b
为新值,您可以使
truncate_before
返回DataFrames的列表,然后重新分配 df_a
和 df_b
到返回值:
To reassign the global variables df_a
, df_b
to a new values, you could make
truncate_before
return the list of DataFrames, and reassign df_a
and df_b
to the returned value:
def truncate_before(list_of_dfts, idx):
return [dfts[idx:] for dfts in list_of_dfts]
mylist = truncate_before(mylist, 11) # or
# df_a, df_b = truncate_before(mylist, 11) # or
# mylist = df_a, df_b = truncate_before(mylist, 11)
但是请注意通过两个访问DataFrame可能不好mylist
和 df_a
和 df_b
,因为如上例所示,这些值不会保持协调一致。使用 mylist
就足够了。
But note that it is probably not good to access the DataFrames through both mylist
and df_a
and df_b
, since as the example above shows, the values do not stay coordinated automagically. Using mylist
should suffice.
使用带有inplace参数的DataFrame方法,例如 df.drop
Use a DataFrame method with the inplace parameter, such as df.drop
dfts.drop
(其中有 = True
)修改 dfts
本身:
dfts.drop
(with inplace=True
) modifies dfts
itself:
import numpy as np
import pandas as pd
df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
mylist = [df_a, df_b]
def truncate_before(list_of_dfts, idx):
for dfts in list_of_dfts:
dfts.drop(dfts.index[:idx], inplace=True)
truncate_before(mylist, 11)
print(mylist[0])
print(df_a)
修改 dfts
就位,同时 mylist
和 df_a
和 df_b
同时更改。
By modifying dfts
inplace, both the values in mylist
and df_a
and df_b
get changed at the same time.
dfts.drop
根据索引标签值删除行。因此,以上内容依靠
(假设) dfts.index
是唯一的。如果 dfts.index
不是唯一的,则
dfts.drop
可能比 idx
行。例如,
Note that dfts.drop
drops rows based on index label value. So the above relies
(assumes) that dfts.index
is unique. If dfts.index
is not unique,
dfts.drop
may more rows than idx
rows. For example,
df = pd.DataFrame([1,2], index=['A', 'A'])
df.drop(['A'], inplace=True)
丢弃两个行都将 df
变成一个空的DataFrame。
drops both rows making df
an empty DataFrame.
还要注意熊猫人核心的警告关于使用就地使用
的开发人员:
Note also this warning from Pandas' core developer regarding the use of inplace
:
我个人的观点:我从不使用放置操作。
的语法更难阅读,并且没有任何优势。
My personal opinion: I never use in-place operations. The syntax is harder to read and its does not offer any advantages.
这可能是因为 dfts.drop
创建一个新的数据框,然后
调用 _update_inplace
私有方法将新数据分配给
旧DataFrame:
This is probably because under the hood, dfts.drop
creates a new dataframe and
then calls the _update_inplace
private method to assign the new data to the
old DataFrame:
def _update_inplace(self, result, verify_is_copy=True):
"""
replace self internals with result.
...
"""
self._reset_cache()
self._clear_item_cache()
self._data = getattr(result,'_data',result)
self._maybe_update_cacher(verify_is_copy=verify_is_copy)
由于必须始终创建临时
,因此与简单的重新分配相比,就地操作没有存储或性能上的好处。
Since the temporary result
had to be created anyway, there is no memory or performance benefit of "in-place" operations over simple reassignment.
这篇关于以编程方式将Pandas数据框切片的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!