根据具有不同索引的参考数据帧连接数据帧中的特定列对 [英] Concatenate specific pairs of columns in a dataframe based on a reference dataframe with a different index

查看:64
本文介绍了根据具有不同索引的参考数据帧连接数据帧中的特定列对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是基于以下对将数据框(源)中的列连接起来: 在单独的数据帧中描述(参考).结果数据框应将源"中的列标题替换为引用"中的索引标签.

My goal is to concatenate columns in a dataframe(Source), based on pairs that are described in a separate dataframe(Reference). The resulting dataframe should replace the column headers in the Source with the index labels in the Reference.

df1 = pd.DataFrame(dict(
    FIRST=['Alpha', 'Alpha', 'Charlie'],
    SECOND=['Bravo', 'Delta', 'Delta']
), ['H1', 'H2', 'H3'])

df1

      FIRST SECOND
H1    Alpha  Bravo
H2    Alpha  Delta
H3  Charlie  Delta

用作数据源的数据框:

df2 = pd.DataFrame(dict(
    Alpha=['A', 'C'],
    Bravo=['A', 'C'],
    Delta=['T', 'C'],
    Charlie=['T', 'G']
), ['item-000', 'item-111'])

df2

           Alpha Bravo Charlie Delta
item-000     A     A       T     T
item-111     C     C       G     C

生成的数据帧将是什么样的:

          H1  H2  H3
item-000  AA  AT  TT
item-111  CC  CC  GC


问题
使用当前设置是否可以做到这一点,还是最好更改数据所在的格式?


Question
Is this possible with the current set up, or would it be best to change the form the data is in?

推荐答案

解决方案
使用pd.get_dummiespd.DataFrame.dot

Solution
Using pd.get_dummies and pd.DataFrame.dot

df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)

          H1  H2  H3
item-000  AA  AT  TT
item-111  CC  CC  GC


说明

我知道我想使用点积.矩阵乘法的规则是n x k矩阵乘以k x m矩阵会得出n x m矩阵.查看最终结果,我在索引中看到['item-000', 'item-111'],即我的n x k矩阵中的n.我看一下我的初步数据帧,在列或索引中是否有一个带有['item-000', 'item-111']的数据帧?我知道了!

I know I want to use a dot product. The rule with the matrix multiplication is that an n x k matrix multiplied by a k x m matrix results in an n x m matrix. Looking at the final result, I see ['item-000', 'item-111'] in the index, that is my n in my n x k matrix. I look at my preliminary dataframes, do I have one with ['item-000', 'item-111'] in either the columns or index? I do!

df2

           Alpha Bravo Charlie Delta
item-000     A     A       T     T
item-111     C     C       G     C

,这意味着我的k['Alpha', 'Bravo', 'Charlie', 'Delta'].好的,所以现在我必须寻找k x m.我仅有的其他数据框是df1,看起来像['Alpha', 'Bravo', 'Charlie', 'Delta']的东西在值中... 不是列或索引.所以我必须把它送到那里.我决定堆叠df1并使用pd.get_dummies.

and that implies my k is ['Alpha', 'Bravo', 'Charlie', 'Delta']. Ok, so now I must look for k x m. The only other dataframe I have is df1 and the things that look like ['Alpha', 'Bravo', 'Charlie', 'Delta'] are in the values... not the columns or index. So I must get it there. I decide to stack df1 and use pd.get_dummies.

pd.get_dummies(df1.stack())

           Alpha  Bravo  Charlie  Delta
H1 FIRST       1      0        0      0
   SECOND      0      1        0      0
H2 FIRST       1      0        0      0
   SECOND      0      0        0      1
H3 FIRST       0      0        1      0
   SECOND      0      0        0      1

现在我的列中有['Alpha', 'Bravo', 'Charlie', 'Delta']!那是我的k.但我在索引中需要它.没问题,请使用移调.

And now I have ['Alpha', 'Bravo', 'Charlie', 'Delta'] in my columns! That's my k. But I need it in my index. No problem, use transpose.

pd.get_dummies(df1.stack()).T

           H1           H2           H3       
        FIRST SECOND FIRST SECOND FIRST SECOND
Alpha       1      0     1      0     0      0
Bravo       0      1     0      0     0      0
Charlie     0      0     0      0     1      0
Delta       0      0     0      1     0      1

正确!现在我已经准备好dot

Right On! Now I'm ready to dot

df2.dot(pd.get_dummies(df1.stack()).T)

            H1           H2           H3       
         FIRST SECOND FIRST SECOND FIRST SECOND
item-000     A      A     A      T     T      T
item-111     C      C     C      C     G      C

我们快到了.我使用pd.DataFrame.sumFIRSTSECOND连接起来,在这里我指定要对行求和并按列对象的第一级分组.

We are almost there. I concatenate FIRST and SECOND by using pd.DataFrame.sum where I specify that I want to sum across rows and grouped by the first level of the columns object.

df2.dot(pd.get_dummies(df1.stack()).T).sum(1, level=0)

          H1  H2  H3
item-000  AA  AT  TT
item-111  CC  CC  GC


设置

df1 = pd.DataFrame(dict(
    FIRST=['Alpha', 'Alpha', 'Charlie'],
    SECOND=['Bravo', 'Delta', 'Delta']
), ['H1', 'H2', 'H3'])

df2 = pd.DataFrame(dict(
    Alpha=['A', 'C'],
    Bravo=['A', 'C'],
    Delta=['T', 'C'],
    Charlie=['T', 'G']
), ['item-000', 'item-111'])

这篇关于根据具有不同索引的参考数据帧连接数据帧中的特定列对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆