根据列标签重塑 pandas 中的数据框 [英] Reshaping dataframes in pandas based on column labels
问题描述
在熊猫中重塑以下数据框的最佳方法是什么?每个样本的此DataFrame df
具有x,y
值(在这种情况下为s1
和s2
),看起来像这样:
What is the best way to reshape the following dataframe in pandas? This DataFrame df
has x,y
values for each sample (s1
and s2
in this case) and looks like this:
In [23]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [24]: df
Out[24]:
s1_x s1_y s2_x s2_y
0 0.913462 0.525590 -0.377640 0.700720
1 0.723288 -0.691715 0.127153 0.180836
2 0.181631 -1.090529 -1.392552 1.530669
3 0.997414 -1.486094 1.207012 0.376120
4 -0.319841 0.195289 -1.034683 0.286073
5 1.085154 -0.619635 0.396867 0.623482
6 1.867816 -0.928101 -0.491929 -0.955295
7 0.920658 -1.132057 1.701582 -0.110299
8 -0.241853 -0.129702 -0.809852 0.014802
9 -0.019523 -0.578930 0.803688 -0.881875
s1_x
和s1_y
是样本1的x/y值,s2_x, s2_y
是样本2的样本值,依此类推.如何将其重塑为仅包含x
,y
的DataFrame列,但包含一个附加列sample
,该列对DataFrame中的每一行说它是来自s1
还是s2
?例如
s1_x
and s1_y
are the x/y values for sample 1, s2_x, s2_y
are the sample values for sample 2, etc. How can this be reshaped into a DataFrame containing only x
, y
columns but that contains an additional column sample
that says for each row in the DataFrame whether it's from s1
or s2
? E.g.
x y sample
0 0.913462 0.525590 s1
1 0.723288 -0.691715 s1
2 0.181631 -1.090529 s1
3 0.997414 -1.486094 s1
...
5 0.396867 0.623482 s2
...
这对于以后使用Rpy2进行绘图很有用,因为许多R绘图功能都可以利用此分组变量,因此这就是我重塑数据框的动机.
This is useful for plotting things with Rpy2 later on, since many R plotting features can make use of this grouping variable, so that's my motivation for reshaping the dataframe.
我认为,Chang She给出的答案不会转换为具有唯一索引的数据帧,例如:
I think the answer given by Chang She doesn't translate to dataframes that have a unique index, like this one:
In [636]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10), "names": range(10)})
In [637]: df
Out[637]:
names s1_x s1_y s2_x s2_y
0 0 0.672298 0.415366 1.034770 0.556209
1 1 0.067087 -0.851028 0.053608 -0.276461
2 2 -0.674174 -0.099015 0.864148 -0.067240
3 3 0.542996 -0.813018 2.283530 2.793727
4 4 0.216633 -0.091870 -0.746411 -0.421852
5 5 0.141301 -1.537721 -0.371601 -1.594634
6 6 1.267148 -0.833120 0.369516 -0.671627
7 7 -0.231163 -0.557398 1.123155 0.865140
8 8 1.790570 -0.428563 0.668987 0.632409
9 9 -0.820315 -0.894855 0.673247 -1.195831
In [638]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [639]: df.stack(0).reset_index(1)
Out[639]:
level_1 x y
0 s1 0.672298 0.415366
0 s2 1.034770 0.556209
1 s1 0.067087 -0.851028
1 s2 0.053608 -0.276461
2 s1 -0.674174 -0.099015
2 s2 0.864148 -0.067240
3 s1 0.542996 -0.813018
3 s2 2.283530 2.793727
4 s1 0.216633 -0.091870
4 s2 -0.746411 -0.421852
5 s1 0.141301 -1.537721
5 s2 -0.371601 -1.594634
6 s1 1.267148 -0.833120
6 s2 0.369516 -0.671627
7 s1 -0.231163 -0.557398
7 s2 1.123155 0.865140
8 s1 1.790570 -0.428563
8 s2 0.668987 0.632409
9 s1 -0.820315 -0.894855
9 s2 0.673247 -1.195831
转换有效,但是在此过程中,列"names"
丢失了.我如何将"names"
列保留在df中,同时仍对名称中具有_
的列进行熔化转换? "names"
列只是为数据框中的每一行分配一个唯一的名称.例如,这里是数字,但是在我的数据中,它们是字符串标识符.
The transformation worked but in the process the column "names"
was lost. How can I keep the "names"
column in the df while still doing the melting transformation on the columns that have _
in their names? The "names"
column just assigns a unique name to each row in the dataframe. It's numeric here for example but in my data they are string identifiers.
谢谢.
推荐答案
我假设您已经有了DataFrame.在这种情况下,您可以将列转换为MultiIndex并使用堆栈,然后使用reset_index.请注意,然后您将不得不对列进行重命名和重新排序,并按示例进行排序,以准确获得 您在问题中发布的内容:
I'm assuming you already have the DataFrame. In which case you can just turn the columns into a MultiIndex and use stack then reset_index. Note that you'll then have to rename and reorder the columns and sort by sample to get exactly what you posted in the question:
In [4]: df = pandas.DataFrame({"s1_x": scipy.randn(10), "s1_y": scipy.randn(10), "s2_x": scipy.randn(10), "s2_y": scipy.randn(10)})
In [5]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [6]: df.stack(0).reset_index(1)
Out[6]:
level_1 x y
0 s1 0.897994 -0.278357
0 s2 -0.008126 -1.701865
1 s1 -1.354633 -0.890960
1 s2 -0.773428 0.003501
2 s1 -1.499422 -1.518993
2 s2 0.240226 1.773427
3 s1 -1.090921 0.847064
3 s2 -1.061303 1.557871
4 s1 -1.697340 -0.160952
4 s2 -0.930642 0.182060
5 s1 -0.356076 -0.661811
5 s2 0.539875 -1.033523
6 s1 -0.687861 -1.450762
6 s2 0.700193 0.658959
7 s1 -0.130422 -0.826465
7 s2 -0.423473 -1.281856
8 s1 0.306983 0.433856
8 s2 0.097279 -0.256159
9 s1 0.498057 0.147243
9 s2 1.312578 0.111837
如果仅可以使用MultiIndex创建DataFrame,则可以保存MultiIndex转换.
You can save the MultiIndex conversion if you can just create the DataFrame with a MultiIndex instead.
使用合并将原始ID重新加入
use merge to join original ids back in
In [59]: df
Out[59]:
names s1_x s1_y s2_x s2_y
0 0 0.732099 0.018387 0.299856 0.737142
1 1 0.914755 -0.798159 -0.732868 -1.279311
2 2 -1.063558 0.161779 -0.115751 -0.251157
3 3 -1.185501 0.095147 -1.343139 -0.003084
4 4 0.622400 -0.299726 0.198710 -0.383060
5 5 0.179318 0.066029 -0.635507 1.366786
6 6 -0.820099 0.066067 1.113402 0.002872
7 7 0.711627 -0.182925 1.391194 -2.788434
8 8 -1.124092 1.303375 0.202691 -0.225993
9 9 -0.179026 0.847466 -1.480708 -0.497067
In [60]: id = df.ix[:, ['names']]
In [61]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [62]: pandas.merge(df.stack(0).reset_index(1), id, left_index=True, right_index=True)
Out[62]:
level_1 x y names
0 s1 0.732099 0.018387 0
0 s2 0.299856 0.737142 0
1 s1 0.914755 -0.798159 1
1 s2 -0.732868 -1.279311 1
2 s1 -1.063558 0.161779 2
2 s2 -0.115751 -0.251157 2
3 s1 -1.185501 0.095147 3
3 s2 -1.343139 -0.003084 3
4 s1 0.622400 -0.299726 4
4 s2 0.198710 -0.383060 4
5 s1 0.179318 0.066029 5
5 s2 -0.635507 1.366786 5
6 s1 -0.820099 0.066067 6
6 s2 1.113402 0.002872 6
7 s1 0.711627 -0.182925 7
7 s2 1.391194 -2.788434 7
8 s1 -1.124092 1.303375 8
8 s2 0.202691 -0.225993 8
9 s1 -0.179026 0.847466 9
9 s2 -1.480708 -0.497067 9
或者:
In [64]: df
Out[64]:
names s1_x s1_y s2_x s2_y
0 0 0.744742 -1.123403 0.212736 0.005440
1 1 0.465075 -0.673491 1.467156 -0.176298
2 2 -1.111566 0.168043 -0.102142 -1.072461
3 3 1.226537 -1.147357 -1.583762 -1.236582
4 4 1.137675 0.224422 0.738988 1.528416
5 5 -0.237014 -1.110303 -0.770221 1.389714
6 6 -0.659213 2.305374 -0.326253 1.416778
7 7 1.524214 -0.395451 -1.884197 0.524606
8 8 0.375112 -0.622555 0.295336 0.927208
9 9 1.168386 -0.291899 -1.462098 0.250889
In [65]: df = df.set_index('names')
In [66]: df.columns = pandas.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
In [67]: df.stack(0).reset_index(1)
Out[67]:
level_1 x y
names
0 s1 0.744742 -1.123403
0 s2 0.212736 0.005440
1 s1 0.465075 -0.673491
1 s2 1.467156 -0.176298
2 s1 -1.111566 0.168043
2 s2 -0.102142 -1.072461
3 s1 1.226537 -1.147357
3 s2 -1.583762 -1.236582
4 s1 1.137675 0.224422
4 s2 0.738988 1.528416
5 s1 -0.237014 -1.110303
5 s2 -0.770221 1.389714
6 s1 -0.659213 2.305374
6 s2 -0.326253 1.416778
7 s1 1.524214 -0.395451
7 s2 -1.884197 0.524606
8 s1 0.375112 -0.622555
8 s2 0.295336 0.927208
9 s1 1.168386 -0.291899
9 s2 -1.462098 0.250889
这篇关于根据列标签重塑 pandas 中的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!