术语“广播"指的是什么?在 pandas 文件中意味着什么? [英] What does the term "broadcasting" mean in Pandas documentation?

查看:97
本文介绍了术语“广播"指的是什么?在 pandas 文件中意味着什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Pandas文档,术语广播"是广泛使用,但从未真正定义或解释过.

I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained.

这是什么意思?

推荐答案

因此,术语广播来自

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

所以最简单的情况就是乘以标量值:

So the simplest case is just multiplying by a scalar value:

In [4]:
s = pd.Series(np.arange(5))
s

Out[4]:
0    0
1    1
2    2
3    3
4    4
dtype: int32

In [5]:    
s * 10

Out[5]:
0     0
1    10
2    20
3    30
4    40
dtype: int32

我们在一个数据框上获得了相同的预期结果:

and we get the same expected results with a dataframe:

In [6]:    
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df

Out[6]:
          a         b
0  0.216920  0.652193
1  0.968969  0.033369
2  0.637784  0.856836
3 -2.303556  0.426238

In [7]:    
df * 10

Out[7]:
           a         b
0   2.169204  6.521925
1   9.689690  0.333695
2   6.377839  8.568362
3 -23.035557  4.262381

因此,从技术上讲,这里的标量值已经沿着上述Series和DataFrame的相同尺寸广播.

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

假设我们有一个形状为4 x 3(4行x 3列)的2-D数据框,我们可以使用与行长相同长度的1-D系列沿x轴执行操作:

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df

Out[8]:
          a         b         c
0  0.122073 -1.178127 -1.531254
1  0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676  1.419547
3  0.215847  1.112350  0.659432

In [26]:    
df.iloc[0]

Out[26]:
a    0.122073
b   -1.178127
c   -1.531254
Name: 0, dtype: float64

In [27]:    
df + df.iloc[0]

Out[27]:
          a         b         c
0  0.244146 -2.356254 -3.062507
1  0.133419 -1.925710 -3.498333
2  0.102357 -1.413803 -0.111707
3  0.337920 -0.065777 -0.871822

上面的内容乍看之下很有趣,直到您了解发生了什么为止,我将值的第一行添加到df中,可以使用此图片将其可视化(来自

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

一般规则是这样:

为了广播,两个数组的尾轴大小 一个操作中的大小必须相同或其中一个必须为 一个.

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

因此,如果我尝试添加长度不匹配的一维数组,则说一个包含4个元素的数组,不像numpy会引发ValueError那样,在Pandas中,您会得到一个充满值:

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

In [30]:
df + pd.Series(np.arange(4))

Out[30]:
    a   b   c   0   1   2   3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN

现在,关于熊猫的一些很棒的事情是,它将尝试使用现有的列名和行标签进行对齐,这可能会妨碍尝试进行如下更高级的广播:

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

In [55]:
df[['a']] + df.iloc[0]

Out[55]:
          a   b   c
0  0.244146 NaN NaN
1  0.133419 NaN NaN
2  0.102357 NaN NaN
3  0.337920 NaN NaN

在上面,我使用双下标强制将形状设置为(4,1),但是当尝试使用第一行进行广播时,由于列对齐仅在第一列上对齐,因此我们看到了一个问题.为了获得与上图所示相同的广播形式,我们必须分解为numpy数组,然后这些数组成为匿名数据:

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

In [56]:
df[['a']].values + df.iloc[0].values

Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
       [ 0.13341899, -1.166781  , -1.51990758],
       [ 0.10235701, -1.19784299, -1.55096957],
       [ 0.33792013, -0.96227987, -1.31540645]])

也可以进行3维广播,但是我并不经常去看那些东西,但是那堆麻木,肮脏和大熊猫的书中有一些例子说明了它是如何工作的.

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

通常来说,要记住的事情是,除了简单的标量值外,对于n-D数组,短轴/尾轴的长度必须匹配,或者其中之一必须为1.

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

更新

似乎上述内容导致了最新版本的熊猫0.20.2

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

,因此您必须先在df上致电.values:

so you have to call .values on the df first:

In[42]:
df[['a']].values + df.iloc[0].values

Out[42]: 
array([[ 0.244146, -1.056054, -1.409181],
       [ 0.133419, -1.166781, -1.519908],
       [ 0.102357, -1.197843, -1.55097 ],
       [ 0.33792 , -0.96228 , -1.315407]])

要将其恢复为原始df,我们可以从np数组构造一个df并将args中的原始列传递给构造函数:

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)

Out[43]: 
          a         b         c
0  0.244146 -1.056054 -1.409181
1  0.133419 -1.166781 -1.519908
2  0.102357 -1.197843 -1.550970
3  0.337920 -0.962280 -1.315407

这篇关于术语“广播"指的是什么?在 pandas 文件中意味着什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆