使用Pandas合并具有交替列的DF列表 [英] Merging list of DFs with alternating columns output using Pandas

查看:134
本文介绍了使用Pandas合并具有交替列的DF列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码:

import pandas as pd

rep1 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP1',[1.00,23.22,11.12]),('RP1',["A","B","C"])   ], orient='columns')
rep2 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP2',[3.33,77.22,18.12]),('RP2',["G","I","K"])   ], orient='columns')
rep3 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'k']), ('Gene', ['foo', 'bar', 'kux']), ('RP3',[99.99,98.29,8.10]),('RP2',["M","P","J"]) ], orient='columns')

tmp = []
tmp.append(rep1)
tmp.append(rep2)
tmp.append(rep3)

这将产生以下数据帧列表.

Which produces the following list of data frames.

In [56]: tmp
Out[56]:
 [  Probe Gene    RP1 RP1
 0     x  foo   1.00   A
 1     y  bar  23.22   B
 2     z  qux  11.12   C,   Probe Gene    RP2 RP2
 0     x  foo   3.33   G
 1     y  bar  77.22   I
 2     z  qux  18.12   K,   Probe Gene    RP3 RP2
 0     x  foo  99.99   M
 1     y  bar  98.29   P
 2     k  kux   8.10   J]

上面的每个数据帧都具有以下特征:

Each of the data frames above has the following characteristics:

  1. 总是包含4列
  2. 第二列和最后一列具有相同的名称
  3. 前两列始终命名为ProbeGene
  4. ProbeGene的内容始终一致出现,即. "x"总是与"foo"一起使用.
  1. Always contain 4 columns
  2. Second and last column has identical names
  3. First two columns always named Probe and Gene
  4. Content of Probe and Gene always appear consistently, ie. 'x' always goes with 'foo'.

我正在尝试将列表中的那些DF合并,以便产生如下结果:

I'm trying to merge those DFs in the list so that it produces this:

  Probe Gene    RP1     RP2  RP3    RP1  RP2  RP3
0     x  foo   1.00    3.33  99.99    A    G   M
1     y  bar  23.22   77.22  98.29    B    I   P
2     z  qux  11.12   18.12   NA      C    K   NA
3     k  kux     NA      NA  8.10     NA   NA  J

我尝试了此代码,但失败了:

I tried this code but failed:

In [67]: reduce(pd.merge,tmp)
MergeError: Left data columns not unique: Index([u'Probe', u'Gene', u'RP1', u'RP1'], dtype='object')

什么是正确的方法?

推荐答案

您可以对列名称进行重复数据删除.这是一种骇客的方式:

You could dedupe the column names. Here's a kind of hacky way:

In [11]: list(rep1.columns[0:2]) + [rep1.columns[2] + "_value"] + [rep1.columns[2] + "_letter"]
Out[11]: ['Probe', 'Gene', 'RP1_value', 'RP1_letter']

In [12]: for rep in tmp:
   .....:     rep.columns = list(rep.columns[0:2]) + [rep.columns[2] + "_value"] + [rep.columns[2] + "_letter"]

In [13]: reduce(pd.merge,tmp)
Out[13]:
  Probe Gene  RP1_value RP1_letter  RP2_value RP2_letter  RP3_value RP3_letter
0     x  foo       1.00          A       3.33          G      99.99          M
1     y  bar      23.22          B      77.22          I      98.29          P


您还需要将其指定为外部合并(以获取NaN行):


You also need to specify it as an outer merge (to get the NaN rows):

In [21]: reduce(lambda x, y: pd.merge(x, y, how='outer'),tmp)
Out[21]:
  Probe Gene  RP1_value RP1_letter  RP2_value RP2_letter  RP3_value RP3_letter
0     x  foo       1.00          A       3.33          G      99.99          M
1     y  bar      23.22          B      77.22          I      98.29          P
2     z  qux      11.12          C      18.12          K        NaN        NaN
3     k  kux        NaN        NaN        NaN        NaN       8.10          J

这篇关于使用Pandas合并具有交替列的DF列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆