如何对每列都有一个系列的 DataFrame 进行操作? [英] How do I operate on a DataFrame with a Series for every column?

查看:94
本文介绍了如何对每列都有一个系列的 DataFrame 进行操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个SeriessDataFramedf,我如何对的每一列进行操作dfs?

Given a Series s and DataFrame df, how do I operate on each column of df with s?

df = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]],
    index=[0, 1],
    columns=['a', 'b', 'c']
)

s = pd.Series([3, 14], index=[0, 1])

当我尝试添加它们时,我得到了所有 np.nan

When I attempt to add them, I get all np.nan

df + s

    a   b   c   0   1
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN

我认为我应该得到的是

    a   b   c
0   4   5   6
1  18  19  20

目标和动机

我已经多次看到此类问题,并且还看到了许多其他涉及此问题的问题.最近,我不得不花一些时间在评论中解释这个概念,同时寻找合适的规范问答.我没有找到,所以我想我会写一个.

Objective and motivation

I've seen this kind of question several times over and have seen many other questions that involve some element of this. Most recently, I had to spend a bit of time explaining this concept in comments while looking for an appropriate canonical Q&A. I did not find one and so I thought I'd write one.

这些问题通常与特定运算有关,但同样适用于大多数算术运算.

These questions usually arises with respect to a specific operation, but equally applies to most arithmetic operations.

  • 如何从 DataFrame 的每一列中减去一个 Series?
  • 如何从 DataFrame 的每一列添加一个 Series?
  • 如何从 DataFrame 中的每一列乘以 Series?
  • 如何从 DataFrame 中的每一列划分 Series?
  • How do I subtract a Series from every column in a DataFrame?
  • How do I add a Series from every column in a DataFrame?
  • How do I multiply a Series from every column in a DataFrame?
  • How do I divide a Series from every column in a DataFrame?

推荐答案

创建SeriesDataFrame 对象的心智模型很有帮助.

It is helpful to create a mental model of what Series and DataFrame objects are.

Series 应该被认为是一个增强的字典.这并不总是一个完美的类比,但我们将从这里开始.此外,您还可以进行其他类比,但我的目标是一本字典,以展示本文的目的.

A Series should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make, but I am targeting a dictionary in order to demonstrate the purpose of this post.

这些是我们可以参考以获取相应值的键.当索引的元素唯一时,与字典的比较变得非常接近.

These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.

这些是索引键控的对应值.

These are the corresponding values that are keyed by the index.

DataFrame 应该被认为是Series 的字典或SeriesSeries.在这种情况下,键是列名称,值是作为 Series 对象的列本身.每个 Series 同意共享相同的 index,它是 DataFrame 的索引.

A DataFrame should be thought of as a dictionary of Series or a Series of Series. In this case the keys are the column names and the values are the columns themselves as Series objects. Each Series agrees to share the same index which is the index of the DataFrame.

这些是我们可以参考以获取相应Series的键.

These are the keys that we can reference to get at the corresponding Series.

这是所有 Series 值都同意共享的索引.

This the the index that all of the Series values agree to share.

它们是同一种东西.一个 DataFrameindex 可以用作另一个 DataFramecolumns.实际上,当您执行 df.T 以获得转置时,就会发生这种情况.

They are the same kind of things. A DataFrames index can be used as another DataFrames columns. In fact, this happens when you do df.T to get a transpose.

这是一个二维数组,包含DataFrame 中的数据.实际情况是values不是存储在DataFrame 对象中的内容.(好吧,有时确实如此,但我不打算尝试描述块管理器).关键是,最好将此视为对数据的二维数组的访问.

This is a two-dimensional array that contains the data in a DataFrame. The reality is that values is not what is stored inside the DataFrame object. (Well, sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a two-dimensional array of the data.

这些是示例 pandas.Index 对象,可用作 SeriesDataFrameindex或者可以用作DataFramecolumns:

These are sample pandas.Index objects that can be used as the index of a Series or DataFrame or can be used as the columns of a DataFrame:

idx_lower = pd.Index([*'abcde'], name='lower')
idx_range = pd.RangeIndex(5, name='range')

这些是使用上述 pandas.Index 对象的示例 pandas.Series 对象:

These are sample pandas.Series objects that use the pandas.Index objects above:

s0 = pd.Series(range(10, 15), idx_lower)
s1 = pd.Series(range(30, 40, 2), idx_lower)
s2 = pd.Series(range(50, 10, -8), idx_range)

这些是使用上述 pandas.Index 对象的示例 pandas.DataFrame 对象:

These are sample pandas.DataFrame objects that use the pandas.Index objects above:

df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower)
df1 = pd.DataFrame(
    np.arange(np.product(df0.shape)).reshape(df0.shape),
    index=idx_range, columns=idx_lower
)


Series on Series

在两个Series上操作时,对齐很明显.您将一个 Seriesindex 与另一个的 index 对齐.


Series on Series

When operating on two Series, the alignment is obvious. You align the index of one Series with the index of the other.

s1 + s0

lower
a    40
b    43
c    46
d    49
e    52
dtype: int64

这与我在操作前随机洗牌时的情况相同.索引仍将保持一致.

Which is the same as when I randomly shuffle one before I operate. The indices will still align.

s1 + s0.sample(frac=1)

lower
a    40
b    43
c    46
d    49
e    52
dtype: int64

而且不是的情况,当我使用改组后的Series的值进行操作时.在这种情况下,Pandas 没有 index 与之对齐,因此从一个位置进行操作.

And is not the case when instead I operate with the values of the shuffled Series. In this case, Pandas doesn't have the index to align with and therefore operates from a positions.

s1 + s0.sample(frac=1).values

lower
a    42
b    42
c    47
d    50
e    49
dtype: int64

添加一个标量

s1 + 1

lower
a    31
b    33
c    35
d    37
e    39
dtype: int64


DataFrameDataFrame

在两个 DataFrame 之间操作时也类似.对齐是显而易见的,并且做了我们认为应该做的:


DataFrame on DataFrame

The similar is true when operating between two DataFrames. The alignment is obvious and does what we think it should do:

df0 + df1

lower    a    b    c    d    e
range
0      100  101  102  103  104
1      105  106  107  108  109
2      110  111  112  113  114
3      115  116  117  118  119
4      120  121  122  123  124

它在两个轴上打乱第二个 DataFrame.indexcolumns 仍然会对齐并给我们同样的东西.

It shuffles the second DataFrame on both axes. The index and columns will still align and give us the same thing.

df0 + df1.sample(frac=1).sample(frac=1, axis=1)

lower    a    b    c    d    e
range
0      100  101  102  103  104
1      105  106  107  108  109
2      110  111  112  113  114
3      115  116  117  118  119
4      120  121  122  123  124

这是相同的改组,但它添加的是数组而不是 DataFrame.不再对齐,会得到不同的结果.

It is the same shuffling, but it adds the array and not the DataFrame. It is no longer aligned and will get different results.

df0 + df1.sample(frac=1).sample(frac=1, axis=1).values

lower    a    b    c    d    e
range
0      123  124  121  122  120
1      118  119  116  117  115
2      108  109  106  107  105
3      103  104  101  102  100
4      113  114  111  112  110

添加一维数组.它将与列对齐并跨行广播.

Add a one-dimensional array. It will align with columns and broadcast across rows.

df0 + [*range(2, df0.shape[1] + 2)]

lower    a    b    c    d    e
range
0      102  103  104  105  106
1      102  103  104  105  106
2      102  103  104  105  106
3      102  103  104  105  106
4      102  103  104  105  106

添加一个标量.没有什么可以对齐的,所以广播到一切:

Add a scalar. There isn't anything to align with, so broadcasts to everything:

df0 + 1

lower    a    b    c    d    e
range
0      101  101  101  101  101
1      101  101  101  101  101
2      101  101  101  101  101
3      101  101  101  101  101
4      101  101  101  101  101


DataFrameSeries

如果 DataFrame 被认为是 Series 的字典,而 Series 被认为是值的字典,那么它当在 DataFrameSeries 之间操作时,它们应该通过它们的键"对齐是很自然的.


DataFrame on Series

If DataFrames are to be thought of as dictionaries of Series and Series are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame and Series that they should be aligned by their "keys".

s0:
lower    a    b    c    d    e
        10   11   12   13   14

df0:
lower    a    b    c    d    e
range
0      100  100  100  100  100
1      100  100  100  100  100
2      100  100  100  100  100
3      100  100  100  100  100
4      100  100  100  100  100

当我们操作时,s0['a']中的10会被添加到df0['a']的整列中>:

And when we operate, the 10 in s0['a'] gets added to the entire column of df0['a']:

df0 + s0

lower    a    b    c    d    e
range
0      110  111  112  113  114
1      110  111  112  113  114
2      110  111  112  113  114
3      110  111  112  113  114
4      110  111  112  113  114

问题的核心和帖子的要点

如果我想要 s2df0 怎么办?

s2:               df0:

             |    lower    a    b    c    d    e
range        |    range
0      50    |    0      100  100  100  100  100
1      42    |    1      100  100  100  100  100
2      34    |    2      100  100  100  100  100
3      26    |    3      100  100  100  100  100
4      18    |    4      100  100  100  100  100

当我操作时,我得到了问题中引用的所有np.nan:

When I operate, I get the all np.nan as cited in the question:

df0 + s2

        a   b   c   d   e   0   1   2   3   4
range
0     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4     NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

这不会产生我们想要的结果,因为 Pandas 正在将 s2indexdf0columns 对齐代码>.结果的columns包括s2indexdf0<的columns的并集/代码>.

This does not produce what we wanted, because Pandas is aligning the index of s2 with the columns of df0. The columns of the result includes a union of the index of s2 and the columns of df0.

我们可以用一个棘手的换位来伪造它:

We could fake it out with a tricky transposition:

(df0.T + s2).T

lower    a    b    c    d    e
range
0      150  150  150  150  150
1      142  142  142  142  142
2      134  134  134  134  134
3      126  126  126  126  126
4      118  118  118  118  118

但事实证明 Pandas 有更好的解决方案.有一些操作方法允许我们传递一个 axis 参数来指定要对齐的轴.

But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis argument to specify the axis to align with.

- <代码>子+ add* mul/ div** 战俘

所以答案很简单:

df0.add(s2, axis='index')

lower    a    b    c    d    e
range
0      150  150  150  150  150
1      142  142  142  142  142
2      134  134  134  134  134
3      126  126  126  126  126
4      118  118  118  118  118

事实证明 axis='index'axis=0 是同义词.axis='columns'axis=1 同义:

It turns out axis='index' is synonymous with axis=0. As is axis='columns' synonymous with axis=1:

df0.add(s2, axis=0)

lower    a    b    c    d    e
range
0      150  150  150  150  150
1      142  142  142  142  142
2      134  134  134  134  134
3      126  126  126  126  126
4      118  118  118  118  118


其余的操作

df0.sub(s2, axis=0)

lower   a   b   c   d   e
range
0      50  50  50  50  50
1      58  58  58  58  58
2      66  66  66  66  66
3      74  74  74  74  74
4      82  82  82  82  82


df0.mul(s2, axis=0)

lower     a     b     c     d     e
range
0      5000  5000  5000  5000  5000
1      4200  4200  4200  4200  4200
2      3400  3400  3400  3400  3400
3      2600  2600  2600  2600  2600
4      1800  1800  1800  1800  1800


df0.div(s2, axis=0)

lower         a         b         c         d         e
range
0      2.000000  2.000000  2.000000  2.000000  2.000000
1      2.380952  2.380952  2.380952  2.380952  2.380952
2      2.941176  2.941176  2.941176  2.941176  2.941176
3      3.846154  3.846154  3.846154  3.846154  3.846154
4      5.555556  5.555556  5.555556  5.555556  5.555556


df0.pow(1 / s2, axis=0)

lower         a         b         c         d         e
range
0      1.096478  1.096478  1.096478  1.096478  1.096478
1      1.115884  1.115884  1.115884  1.115884  1.115884
2      1.145048  1.145048  1.145048  1.145048  1.145048
3      1.193777  1.193777  1.193777  1.193777  1.193777
4      1.291550  1.291550  1.291550  1.291550  1.291550


首先解决一些更高级别的概念很重要.因为我的动机是分享知识和教学,所以我想尽可能清楚地说明这一点.


It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.

这篇关于如何对每列都有一个系列的 DataFrame 进行操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆