如何在每个列都有一个Series的DataFrame上进行操作 [英] How do I operate on a DataFrame with a Series for every column
问题描述
我已经多次看到这种问题,也看到了许多其他涉及到该问题的问题.最近,我不得不花一些时间在评论中解释这个概念,同时寻找适当的规范问答.我没有找到一个,所以我想写一个.
I've seen this kind of question several times over and have seen many other questions that involve some element of this. Most recently, I had to spend a bit of time explaining this concept in comments while looking for an appropriate canonical Q&A. I did not find one and so I thought I'd write one.
这个问题通常是针对特定的运算出现的,但是同样适用于大多数算术运算.
This question usually arises with respect to a specific operation but equally applies to most arithmetic operations.
- 如何从
DataFrame
的每一列中减去Series
? - 如何从
DataFrame
的每一列中添加Series
? - 如何从
DataFrame
的每一列中乘以Series
? - 如何从
DataFrame
的每一列中划分Series
?
- How do I subtract a
Series
from every column in aDataFrame
? - How do I add a
Series
from every column in aDataFrame
? - How do I multiply a
Series
from every column in aDataFrame
? - How do I divide a
Series
from every column in aDataFrame
?
给出一个Series
s
和DataFrame
df
.如何使用s
在df
的每一列上进行操作?
Given a Series
s
and DataFrame
df
. How do I operate on each column of df
with s
?
df = pd.DataFrame(
[[1, 2, 3], [4, 5, 6]],
index=[0, 1],
columns=['a', 'b', 'c']
)
s = pd.Series([3, 14], index=[0, 1])
当我尝试添加它们时,我会得到所有np.nan
When I attempt to add them, I get all np.nan
df + s
a b c 0 1
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
我以为我应该得到的
a b c
0 4 5 6
1 18 19 20
推荐答案
请附上序言.首先要解决一些更高级的概念,这一点很重要.由于我的动机是分享知识和授课,所以我想使这一点尽可能清晰.
Please bear the preamble. It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.
创建有关什么是Series
和DataFrame
对象的心理模型很有帮助.
It is helpful to create a mental model of what Series
and DataFrame
objects are.
Series
应该被认为是增强型词典.这并不总是一个完美的类比,但是我们将从这里开始.另外,您还可以进行其他类比,但我将目标放在字典上,以证明本文的目的.
A Series
should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make but I am targeting a dictionary in order to demonstrate the purpose of this post.
这些是我们可以参考以获取相应值的键.当索引的元素唯一时,与字典的比较就非常接近.
These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.
这些是由索引键键入的对应值.
These are the corresponding values that are keyed by the index.
应该将DataFrame
视为Series
的字典或Series
的Series
.在这种情况下,键是列名,值是作为Series
对象的列本身.每个Series
都同意共享相同的index
,这是DataFrame
的索引.
A DataFrame
should be thought of as a dictionary of Series
or a Series
of Series
. In this case the keys are the column names and the values are the columns themselves as Series
objects. Each Series
agrees to share the same index
which is the index of the DataFrame
.
这些是我们可以参考以在相应的Series
上获得的键.
These are the keys that we can reference to get at the corresponding Series
.
这是所有Series
值均同意共享的索引.
This the the index that all of the Series
values agree to share.
它们是同一种东西.一个DataFrame
s index
可以用作另一个DataFrame
s columns
.实际上,当您进行df.T
以获得转置时,就会发生这种情况.
They are the same kind of things. A DataFrame
s index
can be used as another DataFrame
s columns
. In fact, this happens when you do df.T
to get a transpose.
这是一个二维数组,其中包含DataFrame
中的数据.现实情况是,values
不是 NOT 存储在DataFrame
对象中的内容. (有时候是这样,但是我不想描述块管理器).关键是,最好将其视为对数据的二维数组的访问.
This is a 2 dimensional array that contains the data in a DataFrame
. The reality is that values
is NOT what is stored inside the DataFrame
object. (Well sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a 2 dimensional array of the data.
这些是示例pandas.Index
对象,它们可用作Series
或DataFrame
的index
或可用作DataFrame
These are sample pandas.Index
objects that can be used as the index
of a Series
or DataFrame
or can be used as the columns
of a DataFrame
idx_lower = pd.Index([*'abcde'], name='lower')
idx_range = pd.RangeIndex(5, name='range')
这些是示例pandas.Series
对象,它们使用上面的pandas.Index
对象
These are sample pandas.Series
objects that use the pandas.Index
objects above
s0 = pd.Series(range(10, 15), idx_lower)
s1 = pd.Series(range(30, 40, 2), idx_lower)
s2 = pd.Series(range(50, 10, -8), idx_range)
这些是示例pandas.DataFrame
对象,它们使用上面的pandas.Index
对象
These are sample pandas.DataFrame
objects that use the pandas.Index
objects above
df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower)
df1 = pd.DataFrame(
np.arange(np.product(df0.shape)).reshape(df0.shape),
index=idx_range, columns=idx_lower
)
Series
上的 Series
在两个Series
上进行操作时,对齐方式很明显.您将一个Series
的index
与另一个的index
对齐.
Series
on Series
When operating on two Series
, the alignment is obvious. You align the index
of one Series
with the index
of the other.
s1 + s0
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
与我在操作前随机洗牌时的情况相同.索引仍将对齐.
Which is the same as when I randomly shuffle one before I operate. The indices will still align.
s1 + s0.sample(frac=1)
lower
a 40
b 43
c 46
d 49
e 52
dtype: int64
不是,而是我使用改组后的Series
的值进行操作的情况.在这种情况下,Pandas没有index
要与之对齐,因此不能从某个位置操作.
And is NOT the case when instead I operate with the values of the shuffled Series
. In this case, Pandas doesn't have the index
to align with and therefore operates from a positions.
s1 + s0.sample(frac=1).values
lower
a 42
b 42
c 47
d 50
e 49
dtype: int64
添加标量
s1 + 1
lower
a 31
b 33
c 35
d 37
e 39
dtype: int64
DataFrame
上的 DataFrame
在两个DataFrame
s
之间进行操作时类似.
对齐很明显,并且按照我们认为的方式做
DataFrame
on DataFrame
Similar is true when operating between two DataFrame
s
The alignment is obvious and does what we think it should do
df0 + df1
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
在两个轴上随机播放第二个DataFrame
. index
和columns
仍将对齐并给我们同样的东西.
Shuffle second DataFrame
on both axes. The index
and columns
will still align and give us the same thing.
df0 + df1.sample(frac=1).sample(frac=1, axis=1)
lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124
相同的改组,但添加数组而不是DataFrame
.不再对齐,将获得不同的结果.
Same shuffling but add the array and not the DataFrame
. No longer aligned and will get different results.
df0 + df1.sample(frac=1).sample(frac=1, axis=1).values
lower a b c d e
range
0 123 124 121 122 120
1 118 119 116 117 115
2 108 109 106 107 105
3 103 104 101 102 100
4 113 114 111 112 110
添加一维数组.将与列对齐并跨行广播.
Add 1 dimensional array. Will align with columns and broadcast across rows.
df0 + [*range(2, df0.shape[1] + 2)]
lower a b c d e
range
0 102 103 104 105 106
1 102 103 104 105 106
2 102 103 104 105 106
3 102 103 104 105 106
4 102 103 104 105 106
添加标量.没有什么可以与所有广播保持一致的
Add a scalar. Nothing to align with so broadcasts to everything
df0 + 1
lower a b c d e
range
0 101 101 101 101 101
1 101 101 101 101 101
2 101 101 101 101 101
3 101 101 101 101 101
4 101 101 101 101 101
Series
上的 DataFrame
如果将DataFrame
视为Series
和Series
的字典,则很自然地,当在DataFrame
和Series
之间进行操作时,它们应该按其键"对齐.
DataFrame
on Series
If DataFrame
s are to be though of as dictionaries of Series
and Series
are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame
and Series
that they should be aligned by their "keys".
s0:
lower a b c d e
10 11 12 13 14
df0:
lower a b c d e
range
0 100 100 100 100 100
1 100 100 100 100 100
2 100 100 100 100 100
3 100 100 100 100 100
4 100 100 100 100 100
当我们进行操作时,s0['a']
中的10
被添加到df0['a']
And when we operate, the 10
in s0['a']
gets added to the entire column of df0['a']
df0 + s0
lower a b c d e
range
0 110 111 112 113 114
1 110 111 112 113 114
2 110 111 112 113 114
3 110 111 112 113 114
4 110 111 112 113 114
问题的重点和帖子的重点
如果我想要s2
和df0
怎么办?
s2: df0:
| lower a b c d e
range | range
0 50 | 0 100 100 100 100 100
1 42 | 1 100 100 100 100 100
2 34 | 2 100 100 100 100 100
3 26 | 3 100 100 100 100 100
4 18 | 4 100 100 100 100 100
我进行手术时,得到了问题中所引用的全部np.nan
When I operate, I get the all np.nan
as cited in the question
df0 + s2
a b c d e 0 1 2 3 4
range
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
这不能产生我们想要的.因为Pandas将s2
的index
与df0
的columns
对齐.结果的columns
包括s2
的index
和df0
的columns
的并集.
This does not produce what we wanted. Because Pandas is aligning the index
of s2
with the columns
of df0
. The columns
of the result includes a union of the index
of s2
and the columns
of df0
.
我们可以通过棘手的换位来伪造它
We could fake it out with tricky transposition
(df0.T + s2).T
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
但是事实证明,熊猫有更好的解决方案.有一些操作方法可以让我们传递axis
参数来指定要与之对齐的轴.
But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis
argument to specify the axis to align with.
-
sub
+
add
*
mul
/
div
**
pow
-
sub
+
add
*
mul
/
div
**
pow
所以答案很简单
df0.add(s2, axis='index')
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
结果axis='index'
与axis=0
同义.
正如axis='columns'
与axis=1
Turns out axis='index'
is synonymous with axis=0
.
As is axis='columns'
synonymous with axis=1
df0.add(s2, axis=0)
lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118
其余操作
df0.sub(s2, axis=0)
lower a b c d e
range
0 50 50 50 50 50
1 58 58 58 58 58
2 66 66 66 66 66
3 74 74 74 74 74
4 82 82 82 82 82
df0.mul(s2, axis=0)
lower a b c d e
range
0 5000 5000 5000 5000 5000
1 4200 4200 4200 4200 4200
2 3400 3400 3400 3400 3400
3 2600 2600 2600 2600 2600
4 1800 1800 1800 1800 1800
df0.div(s2, axis=0)
lower a b c d e
range
0 2.000000 2.000000 2.000000 2.000000 2.000000
1 2.380952 2.380952 2.380952 2.380952 2.380952
2 2.941176 2.941176 2.941176 2.941176 2.941176
3 3.846154 3.846154 3.846154 3.846154 3.846154
4 5.555556 5.555556 5.555556 5.555556 5.555556
df0.pow(1 / s2, axis=0)
lower a b c d e
range
0 1.096478 1.096478 1.096478 1.096478 1.096478
1 1.115884 1.115884 1.115884 1.115884 1.115884
2 1.145048 1.145048 1.145048 1.145048 1.145048
3 1.193777 1.193777 1.193777 1.193777 1.193777
4 1.291550 1.291550 1.291550 1.291550 1.291550
这篇关于如何在每个列都有一个Series的DataFrame上进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!