numpy函数如何在内部对pandas对象进行操作? [英] How do numpy functions operate on pandas objects internally?
问题描述
numpy函数(例如np.mean(),np.var()等)接受类似数组的参数,例如np.array或list等.
Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
但是传递熊猫数据框也可以.这意味着熊猫数据框确实可以伪装成一个numpy数组,尽管发现df的基础值确实是numpy数组这一事实,但我发现它有点奇怪.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
对于一个对象来说,它是一个类似数组的对象,我认为应该使用整数索引对它进行切片,以切片numpy数组的方式进行切片.因此,例如df [1:3,2:3]应该可以工作,但是会导致错误.
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
因此,当数据帧进入函数内部时,可能会转换为一个numpy数组.但是,如果是这种情况,那么为什么np.mean(numpy_array)导致的结果与np.mean(df)的结果不同?
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
给出的结果与以下给出的结果不同...
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
前一个输出是单个数字,而后者是按列的均值. numpy函数如何知道数据帧的构成?
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?
推荐答案
如果您逐步执行此操作:
If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
您会看到type
不是ndarray
,因此它会尝试调用a.mean
,在这种情况下为df.mean()
:
You can see that the type
is not a ndarray
so it tries to call a.mean
which in this case would be df.mean()
:
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
这就是为什么输出不同的原因
This is why the output is different
上面复制的代码:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy输出:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
如果您调用.values
返回一个np
数组,则输出是相同的:
If you'd called .values
to return a np
array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818
这篇关于numpy函数如何在内部对pandas对象进行操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!