DataFrame.values中的更改是否总是会修改数据框中的值? [英] Will changes in DataFrame.values always modify the values in the data frame?

查看:857
本文介绍了DataFrame.values中的更改是否总是会修改数据框中的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在文档上显示

NDFrame的小块表示法-

"NDFrame的Numpy表示"是什么意思?修改此numpy表示会影响我的原始数据帧吗?换句话说, .values会返回副本或视图吗?

What does "Numpy representation of NDFrame" mean? Will modifying this numpy representation affect my original dataframe? In other words, will .values return a copy or a view?

内含中暗示了StackOverflow中问题的答案,建议(依靠)返回视图.例如,在>在熊猫对角线上设置值的可接受答案中. DataFrame np.fill_diagonal(df.values, 0)用于将df的对角线部分的所有值设置为0.在这种情况下,将返回一个视图.但是,如

There are answers to questions in StackOverflow implicitly suggesting (relying on) that a view be returned. For example, in the accepted answer of Set values on the diagonal of pandas.DataFrame,np.fill_diagonal(df.values, 0) is used to set all values on the diagonal part of df to 0. That is a view is returned in this case. However, as shown in @coldspeed's answer, sometimes a copy is returned.

这感觉很基础.对我来说有点奇怪,因为我没有更详细的.values资料.

This feels very basic. It is just a bit weird to me because I do not have a more detailed source of .values.

除了@coldspeed的答案中的当前实验之外,另一个返回视图的实验:

Another experiment that returns a view in addition to the current experiments in @coldspeed's answer:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

我们得到

df
    0   1
0   0   B
1   C   D

即使现在是混合类型,我们仍然可以通过设置df.values

Even though it is mixed type now, we can still modify original df by setting df.values

df.values[0][1] = 5
df
    0   1
0   0   5
1   C   D

推荐答案

TL; DR:

如果返回副本(然后更改值不会更改DataFrame)或values返回视图,则是实现细节. strong>(然后更改值会更改DataFrame).不要依赖这些情况中的任何一个.如果熊猫开发人员认为这将是有益的,那么它可能会改变(例如,如果他们改变了DataFrame的内部结构).

It's an implementation detail if a copy is returned (then changing the values would not change the DataFrame) or if values returns a view (then changing the values would change the DataFrame). Don't rely on any of these cases. It could change if the pandas developers think it would be beneficial (for example if they changed the internal structure of DataFrame).

我猜自问问题以来,文档已更改,目前显示为:

I guess the documentation has changed since the question was asked, currently it reads:

pandas.DataFrame.values

返回DataFrame的Numpy表示形式.

pandas.DataFrame.values

Return a Numpy representation of the DataFrame.

仅返回DataFrame中的值,将删除轴标签.

Only the values in the DataFrame will be returned, the axes labels will be removed.

它不再提及NDFrame,而只是提及"DataFrame的NumPy表示形式". NumPy表示形式可以是视图或副本!

It doesn't mention NDFrame anymore - but simply mentions a "NumPy representation of the DataFrame". A NumPy representation could be either a view or a copy!

文档还包含有关混合dtypes的Note:

The documentation also contains a Note about mixed dtypes:

注释

dtype将是一个较低的公分母dtype(隐式向上转换);也就是说,如果dtypes(甚至是数字类型)混合在一起,则将选择容纳所有类型的dtypes.如果您不处理块,请小心使用.

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

例如如果dtype是float16和float32,则dtype将被向上转换为float32.如果dtype是int32和uint8,则dtype将被转换为int32.根据numpy.find_common_type()约定,将int64和uint64混合使用会导致float64 dtype.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.

从这些说明中可以明显看出,访问包含不同dtypes的DataFrame的values可以(几乎)从不返回视图.仅仅是因为它需要将值放入最低公分母" dtype的数组中,并且涉及一个副本.

From these Notes it's obvious that accessing the values of a DataFrame that contains different dtypes can (almost) never return a view. Simply because it needs to put the values into an array of the "lowest-common-denominator" dtype and that involves a copy.

但是它没有说明视图/复制行为,这是设计使然.在熊猫问题跟踪器 1上提到的 jreback sup>这确实只是实现细节:

However it doesn't say anything about the view / copy behavior and that's by design. jreback mentioned on the pandas issue tracker 1 that this really is just an implementation detail:

这是一个实现细节.由于您将获得一个dtyped numpy数组,因此将其向上转换为兼容的dtype.如果您有混合dtype,那么您几乎总是会有一个副本(例外是混合浮动dtypes不会复制),但这是一个令人毛骨悚然的细节.

this is an implementation detail. since you are getting a single dtyped numpy array, it is upcast to a compatible dtype. if you have mixed dtypes, then you almost always will have a copy (the exception is mixed float dtypes will not copy I think), but this is a numpy detail.

我同意这不是很好,但是它从一开始就存在,并且不会在当前的大熊猫中改变.如果导出到numpy,则需要保重.

I agree this is not great, but it has been there from the beginning and will not change in current pandas. If exporting to numpy you need to take care.

即使Series的文档也没有提及视图:

Even the documentation of Series mentions nothing about a view:

pandas.Series.values

根据dtype,返回系列为ndarray或类似ndarray

pandas.Series.values

Return Series as ndarray or ndarray-like depending on the dtype

它甚至提到根据d​​type可能甚至不返回纯数组.这当然包括返回副本的可能性(即使只是假设).它不能保证您得到视图.

It even mentions that it might not even return a plain array depending on the dtype. And that certainly includes the possibility (even if it's only hypothetical) that it returns a copy. It does not guarantee that you get a view.

答案很简单:这是实现细节,只要是实现细节,就没有任何保证.它是实现细节的原因是,熊猫开发人员希望确定是否可以更改内部存储. 但是,在某些情况下,无法创建视图.例如,对于包含不同dtype列的DataFrame.

The answer is simply: It's an implementation detail and as long as it's an implementation detail there won't be any guarantees. The reason it's an implementation detail is because the pandas developers want to make certain that they can change the internal storage if they want to. However in some cases it's impossible to create a view. For example with a DataFrame containing columns of different dtypes.

如果您分析迄今为止的行为,可能会有好处.但是,只要这是一个实现细节,您就不应该真的依赖它.

There might be advantages if you analyze the behavior to date. But as long as that's an implementation detail you shouldn't really rely on it anyways.

但是,如果您感兴趣的话:当前熊猫在内部存储与多维数组具有相同dtype的列.这样做的好处是,您可以非常有效地对行和列进行操作(至少只要它们具有相同的dtype).但是,如果DataFrame包含混合类型,它将具有多个内部多维数组.每个dtype一个.无法创建指向两个不同数组的视图(至少对于NumPy),因此当您混合使用dtypes时,如果需要values,将获得一个副本.

However if you're interested: Pandas currently stores columns with the same dtype internally as multi-dimensional array. That has the advantage that you can operate on rows and columns very efficiently (at least as long as they have the same dtype). But if the DataFrame contains mixed types it will have several internal multi-dimensional arrays. One for each dtype. It's not possible to create a view that points into two distinct arrays (at least for NumPy) so when you have mixed dtypes you'll get a copy when you want the values.

旁注,您的示例:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

不是混合dtype.它具有特定的dtype:object.但是object数组可以包含任何Python对象,因此我可以理解为什么您会说/假设它是混合类型.

Isn't mixed-dtype. It has a specific dtype: object. However object arrays can contain any Python object, so I can see why you would say/assume that it's of mixed types.

个人笔记:

就我个人而言,我更希望values属性仅在无法返回视图时才返回视图或错误,以及即使有可能获得返回值也只能返回副本的其他方法(例如as_array).看法.这肯定会使行为更加可预测,并避免出现意外情况,例如拥有财产进行昂贵复制的财产肯定是出乎意料的.

Personally I would have preferred that the values property only ever returns views or errors when it cannot return a view and an additional method (e.g. as_array) that only ever returns copies even if it would be possible to get a view. That would certainly make the behavior more predictable and avoid some surprises like having a property doing an expensive copy is certainly unexpected.

1 这个问题已在问题发布中提到,因此文档可能因为这个问题而发生了变化.

1 This question has been mentioned in the issue post, so maybe the docs changed because of this question.

这篇关于DataFrame.values中的更改是否总是会修改数据框中的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆