numpy函数如何在内部对pandas对象进行操作? [英] How do numpy functions operate on pandas objects internally?

查看:104
本文介绍了numpy函数如何在内部对pandas对象进行操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

numpy函数(例如np.mean(),np.var()等)接受类似数组的参数,例如np.array或list等.

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.

但是传递熊猫数据框也可以.这意味着熊猫数据框确实可以伪装成一个numpy数组,尽管发现df的基础值确实是numpy数组这一事实,但我发现它有点奇怪.

But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).

对于一个对象来说,它是一个类似数组的对象,我认为应该使用整数索引对它进行切片,以切片numpy数组的方式进行切片.因此,例如df [1:3,2:3]应该可以工作,但是会导致错误.

For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.

因此,当数据帧进入函数内部时,可能会转换为一个numpy数组.但是,如果是这种情况,那么为什么np.mean(numpy_array)导致的结果与np.mean(df)的结果不同?

So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?

a = np.random.rand(4,2)
a
Out[13]: 
array([[ 0.86688862,  0.09682919],
   [ 0.49629578,  0.78263523],
   [ 0.83552411,  0.71907931],
   [ 0.95039642,  0.71795655]])

np.mean(a)
Out[14]: 0.68320065182041034

给出的结果与以下给出的结果不同...

gives a different result than what the below gives...

df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))

df
Out[18]: 
      0         1
0  0.866889  0.096829
1  0.496296  0.782635
2  0.835524  0.719079
3  0.950396  0.717957

np.mean(df)
Out[21]: 
0    0.787276
1    0.579125
dtype: float64

前一个输出是单个数字,而后者是按列的均值. numpy函数如何知道数据帧的构成?

The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?

推荐答案

如果您逐步执行此操作:

If you step through this:

--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean

您会看到type不是ndarray,因此它会尝试调用a.mean,在这种情况下为df.mean():

You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():

In [6]:

df.mean()
Out[6]:
0    0.572999
1    0.468268
dtype: float64

这就是为什么输出不同的原因

This is why the output is different

上面复制的代码:

In [3]:
a = np.random.rand(4,2)
a

Out[3]:
array([[ 0.96750329,  0.67623187],
       [ 0.44025179,  0.97312747],
       [ 0.07330062,  0.18341157],
       [ 0.81094166,  0.04030253]])

In [4]:    
np.mean(a)

Out[4]:
0.52063384885403818

In [5]:    
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]), 
columns=range(np.shape(a)[1]))
​
df

Out[5]:
          0         1
0  0.967503  0.676232
1  0.440252  0.973127
2  0.073301  0.183412
3  0.810942  0.040303

numpy输出:

In [7]:
np.mean(df)

Out[7]:
0    0.572999
1    0.468268
dtype: float64

如果您调用.values返回一个np数组,则输出是相同的:

If you'd called .values to return a np array then the output is the same:

In [8]:
np.mean(df.values)

Out[8]:
0.52063384885403818

这篇关于numpy函数如何在内部对pandas对象进行操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆