pandas :合并两个具有不同名称的列? [英] pandas: Merge two columns with different names?

查看:241
本文介绍了 pandas :合并两个具有不同名称的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将上方和下方的两个数据帧连接起来.不能并排连接.

I am trying to concatenate two dataframes, above and below. Not concatenate side-by-side.

数据帧包含相同的数据,但是,在第一个数据帧中,一列的名称可能为"ObjectType",而在第二个数据帧中,此列的名称可能为"ObjectClass".

The dataframes contain the same data, however, in the first dataframe one column might have name "ObjectType" and in the second dataframe the column might have name "ObjectClass". When I do

df_total = pandas.concat ([df0, df1])

df_total将具有两个列名称,一个列名为"ObjectType",另一个列为"ObjectClass".在这两列的每一列中,一半的值将是"NaN".因此,我必须手动将这两列合并为一个列.

the df_total will have two column names, one with "ObjectType" and another with "ObjectClass". In each of these two columns, half of the values will be "NaN". So I have to manually merge these two columns into one which is a pain.

我可以以某种方式将两列合并为一个吗?我希望有一个功能可以执行以下操作:

Can I somehow merge the two columns into one? I would like to have a function that does something like:

df_total = pandas.merge_many_columns(input=["ObjectType,"ObjectClass"], output=["MyObjectClasses"]

它将两个列合并,并创建一个新列.我已经研究了melt(),但实际上并没有做到这一点吗?

which merges the two columns and creates a new column. I have looked into melt() but it does not really do this?

(也许我可以指定发生冲突时会发生什么,比如说两列包含值,那会很好,在这种情况下,我提供了一个lambda函数,该函数表示保持最大值",使用平均值等)

(Maybe it would be nice if I could specify what will happen if there is a collision, say that two columns contain values, in that case I supply a lambda function that says "keep the largest value", "use an average", etc)

推荐答案

我认为您可以先重命名列以对齐两个DataFrame中的数据:

I think you can rename column first for align data in both DataFrames:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}
print (d)
{'ObjectType': 'MyObjectClasses', 'ObjectClass': 'MyObjectClasses'}

df0 = df0.rename(columns=d)
df1 = df1.rename(columns=d)
df_total = pd.concat([df0, df1], ignore_index=True)
print (df_total)
   B  C  MyObjectClasses
0  4  7                1
1  5  8                2
2  6  9                3
3  4  7                1
4  5  8                2
5  6  9                3

更简单的是 update (正在运行inplace):

df = pd.concat([df0, df1])
df['ObjectType'].update(df['ObjectClass'])
print (df)
   B  C  ObjectClass  ObjectType
0  4  7          NaN         1.0
1  5  8          NaN         2.0
2  6  9          NaN         3.0
0  4  7          1.0         1.0
1  5  8          2.0         2.0
2  6  9          3.0         3.0

fillna ,但随后需要删除原始列中的列:

Or fillna, but then need drop original columns columns:

df = pd.concat([df0, df1])
df["ObjectType"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop('ObjectClass', axis=1)
print (df)
   B  C  ObjectType
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0
0  4  7         1.0
1  5  8         2.0
2  6  9         3.0


df = pd.concat([df0, df1])
df["MyObjectClasses"] = df['ObjectType'].fillna(df['ObjectClass'])
df = df.drop(['ObjectType','ObjectClass'], axis=1)
print (df)
   B  C  MyObjectClasses
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0
0  4  7              1.0
1  5  8              2.0
2  6  9              3.0

时间:

df0 = pd.DataFrame({'ObjectType':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df0)

df1 = pd.DataFrame({'ObjectClass':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9]})

#print (df1)
df0 = pd.concat([df0]*1000).reset_index(drop=True)
df1 = pd.concat([df1]*1000).reset_index(drop=True)

inputs= ["ObjectType","ObjectClass"]
output= "MyObjectClasses"

#dict comprehension 
d = {x:output for x in inputs}


In [241]: %timeit df_total = pd.concat([df0.rename(columns=d), df1.rename(columns=d)], ignore_index=True)
1000 loops, best of 3: 821 µs per loop

In [240]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['ObjectType'].update(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 

100 loops, best of 3: 2.18 ms per loop

In [242]: %%timeit
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].combine_first(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.21 ms per loop

In [243]: %%timeit 
     ...: df = pd.concat([df0, df1])
     ...: df['MyObjectClasses'] = df['ObjectType'].fillna(df['ObjectClass'])
     ...: df = df.drop(['ObjectType','ObjectClass'], axis=1)
     ...: 
100 loops, best of 3: 2.28 ms per loop

这篇关于 pandas :合并两个具有不同名称的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆