根据具有不同索引的参考数据帧连接数据帧中的特定列对 [英] Concatenate specific pairs of columns in a dataframe based on a reference dataframe with a different index
问题描述
我的目标是基于以下对将数据框(源)中的列连接起来: 在单独的数据帧中描述(参考).结果数据框应将源"中的列标题替换为引用"中的索引标签.
My goal is to concatenate columns in a dataframe(Source), based on pairs that are described in a separate dataframe(Reference). The resulting dataframe should replace the column headers in the Source with the index labels in the Reference.
df1 = pd.DataFrame(dict(
FIRST=['Alpha', 'Alpha', 'Charlie'],
SECOND=['Bravo', 'Delta', 'Delta']
), ['H1', 'H2', 'H3'])
df1
FIRST SECOND
H1 Alpha Bravo
H2 Alpha Delta
H3 Charlie Delta
用作数据源的数据框:
df2 = pd.DataFrame(dict(
Alpha=['A', 'C'],
Bravo=['A', 'C'],
Delta=['T', 'C'],
Charlie=['T', 'G']
), ['item-000', 'item-111'])
df2
Alpha Bravo Charlie Delta
item-000 A A T T
item-111 C C G C
生成的数据帧将是什么样的:
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
问题
使用当前设置是否可以做到这一点,还是最好更改数据所在的格式?
Question
Is this possible with the current set up, or would it be best to change the form the data is in?
推荐答案
解决方案
使用pd.get_dummies
和pd.DataFrame.dot
Solution
Using pd.get_dummies
and pd.DataFrame.dot
df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
说明
我知道我想使用点积.矩阵乘法的规则是n x k
矩阵乘以k x m
矩阵会得出n x m
矩阵.查看最终结果,我在索引中看到['item-000', 'item-111']
,即我的n x k
矩阵中的n
.我看一下我的初步数据帧,在列或索引中是否有一个带有['item-000', 'item-111']
的数据帧?我知道了!
I know I want to use a dot product. The rule with the matrix multiplication is that an n x k
matrix multiplied by a k x m
matrix results in an n x m
matrix. Looking at the final result, I see ['item-000', 'item-111']
in the index, that is my n
in my n x k
matrix. I look at my preliminary dataframes, do I have one with ['item-000', 'item-111']
in either the columns or index? I do!
df2
Alpha Bravo Charlie Delta
item-000 A A T T
item-111 C C G C
,这意味着我的k
是['Alpha', 'Bravo', 'Charlie', 'Delta']
.好的,所以现在我必须寻找k x m
.我仅有的其他数据框是df1
,看起来像['Alpha', 'Bravo', 'Charlie', 'Delta']
的东西在值中... 不是列或索引.所以我必须把它送到那里.我决定堆叠df1
并使用pd.get_dummies
.
and that implies my k
is ['Alpha', 'Bravo', 'Charlie', 'Delta']
. Ok, so now I must look for k x m
. The only other dataframe I have is df1
and the things that look like ['Alpha', 'Bravo', 'Charlie', 'Delta']
are in the values... not the columns or index. So I must get it there. I decide to stack df1
and use pd.get_dummies
.
pd.get_dummies(df1.stack())
Alpha Bravo Charlie Delta
H1 FIRST 1 0 0 0
SECOND 0 1 0 0
H2 FIRST 1 0 0 0
SECOND 0 0 0 1
H3 FIRST 0 0 1 0
SECOND 0 0 0 1
现在我的列中有['Alpha', 'Bravo', 'Charlie', 'Delta']
!那是我的k
.但我在索引中需要它.没问题,请使用移调.
And now I have ['Alpha', 'Bravo', 'Charlie', 'Delta']
in my columns! That's my k
. But I need it in my index. No problem, use transpose.
pd.get_dummies(df1.stack()).T
H1 H2 H3
FIRST SECOND FIRST SECOND FIRST SECOND
Alpha 1 0 1 0 0 0
Bravo 0 1 0 0 0 0
Charlie 0 0 0 0 1 0
Delta 0 0 0 1 0 1
正确!现在我已经准备好dot
Right On! Now I'm ready to dot
df2.dot(pd.get_dummies(df1.stack()).T)
H1 H2 H3
FIRST SECOND FIRST SECOND FIRST SECOND
item-000 A A A T T T
item-111 C C C C G C
我们快到了.我使用pd.DataFrame.sum
将FIRST
和SECOND
连接起来,在这里我指定要对行求和并按列对象的第一级分组.
We are almost there. I concatenate FIRST
and SECOND
by using pd.DataFrame.sum
where I specify that I want to sum across rows and grouped by the first level of the columns object.
df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)
H1 H2 H3
item-000 AA AT TT
item-111 CC CC GC
设置
df1 = pd.DataFrame(dict(
FIRST=['Alpha', 'Alpha', 'Charlie'],
SECOND=['Bravo', 'Delta', 'Delta']
), ['H1', 'H2', 'H3'])
df2 = pd.DataFrame(dict(
Alpha=['A', 'C'],
Bravo=['A', 'C'],
Delta=['T', 'C'],
Charlie=['T', 'G']
), ['item-000', 'item-111'])
这篇关于根据具有不同索引的参考数据帧连接数据帧中的特定列对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!