“广播"一词是什么意思?在 Pandas 文档中是什么意思? [英] What does the term "broadcasting" mean in Pandas documentation?

查看:39
本文介绍了“广播"一词是什么意思?在 Pandas 文档中是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 Pandas 文档,术语广播"是 广泛使用,但从未真正定义或解释过.

什么意思?

解决方案

所以 broadcasting 这个词来自 numpy,简单地说就是解释了当你在 n 维数组(可以是面板、数据框、系列)或标量值.

使用标量值进行广播

所以最简单的情况就是乘以一个标量值:

在 [4] 中:s = pd.Series(np.arange(5))秒出[4]:0 01 12 23 34 4数据类型:int32在 [5]:秒 * 10出[5]:0 01 102 203 304 40数据类型:int32

并且我们通过数据框获得了相同的预期结果:

在 [6] 中:df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})df出[6]:乙0 0.216920 0.6521931 0.968969 0.0333692 0.637784 0.8568363 -2.303556 0.426238在 [7] 中:df * 10出[7]:乙0 2.169204 6.5219251 9.689690 0.3336952 6.377839 8.5683623 -23.035557 4.262381

从技术上讲,这里发生的事情是标量值已经广播,与上面的 Series 和 DataFrame 的维度相同.

使用一维数组进行广播

假设我们有一个形状为 4 x 3(4 行 x 3 列)的二维数据框,我们可以使用与行长度相同的一维系列沿 x 轴执行操作:

在 [8] 中:df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})df出[8]:a b c0 0.122073 -1.178127 -1.53​​12541 0.011346 -0.747583 -1.9670792 -0.019716 -0.235676 1.4195473 0.215847 1.112350 0.659432在 [26] 中:df.iloc[0]出[26]:0.122073b -1.178127c -1.53​​1254名称:0,数据类型:float64在 [27] 中:df + df.iloc[0]出[27]:a b c0 0.244146 -2.356254 -3.0625071 0.133419 -1.925710 -3.4983332 0.102357 -1.413803 -0.1117073 0.337920 -0.065777 -0.871822

上面的内容一开始看起来很有趣,直到你明白发生了什么,我取了第一行值并将这一行逐行添加到 df,它可以使用这张图片进行可视化(来自 scipy):

一般规则是这样的:

<块引用>

为了广播,两个数组的尾轴的大小操作中的大小必须相同或其中之一必须是一个.

因此,如果我尝试添加一个长度不匹配的一维数组,比如说一个有 4 个元素的数组,这与 numpy 会引发 ValueError 不同,在 Pandas 中你会得到一个df 充满 NaN 值:

在 [30] 中:df + pd.Series(np.arange(4))出[30]:a b c 0 1 2 30 NaN NaN NaN NaN NaN NaN NaN1 NaN NaN NaN NaN NaN NaN NaN2 NaN NaN NaN NaN NaN NaN NaN3 NaN NaN NaN NaN NaN NaN NaN

现在关于 Pandas 的一些伟大的事情是它会尝试使用现有的列名和行标签对齐,这可能会妨碍像这样执行更高级的广播:

在 [55] 中:df[['a']] + df.iloc[0]出[55]:a b c0 0.244146 NaN NaN1 0.133419 NaN NaN2 0.102357 NaN NaN3 0.337920 NaN NaN

在上面我使用双下标将形状强制为 (4,1) 但我们在尝试使用第一行广播时遇到问题,因为列对齐仅在第一列上对齐.为了获得上图所示的相同形式的广播,我们必须分解为 numpy 数组,然后成为匿名数据:

在 [56] 中:df[['a']].values + df.iloc[0].values出[56]:数组([[ 0.24414608, -1.05605392, -1.4091805 ],[ 0.13341899, -1.166781, -1.51990758],[ 0.10235701, -1.19784299, -1.55096957],[ 0.33792013, -0.96227987, -1.31540645]])

也可以在 3 维中进行广播,但我不经常接近那些东西,但是 numpy、scipy 和 pandas 的书有展示其工作原理的示例.

一般来说,要记住的是,除了简单的标量值之外,对于 n 维数组,短轴/尾轴长度必须匹配或其中之一必须为 1.

更新

似乎上面现在导致ValueError: Unable to coerce to Series, length must be 1: given 3 in the latest version of pandas 0.20.2

所以你必须先在 df 上调用 .values :

在[42]:df[['a']].values + df.iloc[0].values出[42]:数组([[ 0.244146, -1.056054, -1.409181],[ 0.133419, -1.166781, -1.519908],[ 0.102357, -1.197843, -1.55097 ],[ 0.33792 , -0.96228 , -1.315407]])

要将其恢复为原始 df,我们可以从 np 数组构造一个 df 并将 args 中的原始列传递给构造函数:

在[43]:pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)出[43]:a b c0 0.244146 -1.056054 -1.4091811 0.133419 -1.166781 -1.5199082 0.102357 -1.197843 -1.5509703 0.337920 -0.962280 -1.315407

I'm reading through the Pandas documentation, and the term "broadcasting" is used extensively, but never really defined or explained.

What does it mean?

解决方案

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

Broadcasting using a scalar value

So the simplest case is just multiplying by a scalar value:

In [4]:
s = pd.Series(np.arange(5))
s

Out[4]:
0    0
1    1
2    2
3    3
4    4
dtype: int32

In [5]:    
s * 10

Out[5]:
0     0
1    10
2    20
3    30
4    40
dtype: int32

and we get the same expected results with a dataframe:

In [6]:    
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df

Out[6]:
          a         b
0  0.216920  0.652193
1  0.968969  0.033369
2  0.637784  0.856836
3 -2.303556  0.426238

In [7]:    
df * 10

Out[7]:
           a         b
0   2.169204  6.521925
1   9.689690  0.333695
2   6.377839  8.568362
3 -23.035557  4.262381

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

Broadcasting using a 1-D array

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df

Out[8]:
          a         b         c
0  0.122073 -1.178127 -1.531254
1  0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676  1.419547
3  0.215847  1.112350  0.659432

In [26]:    
df.iloc[0]

Out[26]:
a    0.122073
b   -1.178127
c   -1.531254
Name: 0, dtype: float64

In [27]:    
df + df.iloc[0]

Out[27]:
          a         b         c
0  0.244146 -2.356254 -3.062507
1  0.133419 -1.925710 -3.498333
2  0.102357 -1.413803 -0.111707
3  0.337920 -0.065777 -0.871822

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

The general rule is this:

In order to broadcast, the size of the trailing axes for both arrays in an operation must either be the same size or one of them must be one.

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

In [30]:
df + pd.Series(np.arange(4))

Out[30]:
    a   b   c   0   1   2   3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

In [55]:
df[['a']] + df.iloc[0]

Out[55]:
          a   b   c
0  0.244146 NaN NaN
1  0.133419 NaN NaN
2  0.102357 NaN NaN
3  0.337920 NaN NaN

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

In [56]:
df[['a']].values + df.iloc[0].values

Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
       [ 0.13341899, -1.166781  , -1.51990758],
       [ 0.10235701, -1.19784299, -1.55096957],
       [ 0.33792013, -0.96227987, -1.31540645]])

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

Update

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

so you have to call .values on the df first:

In[42]:
df[['a']].values + df.iloc[0].values

Out[42]: 
array([[ 0.244146, -1.056054, -1.409181],
       [ 0.133419, -1.166781, -1.519908],
       [ 0.102357, -1.197843, -1.55097 ],
       [ 0.33792 , -0.96228 , -1.315407]])

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)

Out[43]: 
          a         b         c
0  0.244146 -1.056054 -1.409181
1  0.133419 -1.166781 -1.519908
2  0.102357 -1.197843 -1.550970
3  0.337920 -0.962280 -1.315407

这篇关于“广播"一词是什么意思?在 Pandas 文档中是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆