连接具有不同列顺序的数据框 [英] concatenate dataframes with different column ordering

查看:92
本文介绍了连接具有不同列顺序的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析excel文件中的数据,结果DataFrame的列可能对齐或可能不对齐我想堆叠几个解析的DataFrame的基本DataFrame.

I am parsing data from excel files and the columns of the resulting DataFrame may or may not align to a base DataFramewhere I want to stack several parsed DataFrame.

让我们从数据A和基础DataFrame df_A中调用我解析的DataFrame.

Lets call the DataFrame I parse from data A, and the base DataFrame df_A.

我读到了一个Excel脚本,结果是A=

I read an excel shee resulting in A=

Index                    AGUB  AGUG   MUEB   MUEB    SIL    SIL   SILB   SILB
2012-01-01 00:00:00      0.00     0   0.00  50.78   0.00   0.00   0.00   0.00
2012-01-01 01:00:00      0.00     0   0.00  53.15   0.00  53.15   0.00   0.00
2012-01-01 02:00:00      0.00     0   0.00   0.00  53.15  53.15  53.15  53.15
2012-01-01 03:00:00      0.00     0   0.00   0.00   0.00  55.16   0.00   0.00
2012-01-01 04:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 05:00:00     48.96     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 06:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 07:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 08:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 09:00:00     52.28     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 10:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 11:00:00     36.93     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 12:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 13:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  50.00
2012-01-01 14:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00  34.01
2012-01-01 15:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 16:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 17:00:00     53.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 18:00:00      0.00    75   0.00  75.00   0.00  75.00   0.00   0.00
2012-01-01 19:00:00      0.00    70   0.00  70.00   0.00   0.00   0.00   0.00
2012-01-01 20:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 21:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 22:00:00      0.00     0   0.00   0.00   0.00   0.00   0.00   0.00
2012-01-01 23:00:00      0.00     0  53.45  53.45   0.00   0.00   0.00   0.00

我创建基本数据框:

units = ['MUE', 'MUEB', 'SIL', 'SILB', 'AGUG', 'AGUB', 'MUEBP', 'MUELP']
df_A = pd.DataFrame(columns=units)
df_A = pd.concat([df_A, A], axis=0)

通常使用concat,如果A的列少于df_A会没事,但是在这种情况下,列的唯一区别是顺序.串联会导致以下错误:

Usually with concat if A had less columns than df_A it'll be fine, but in this case the only difference in the columns is the order. the concatenation leads to the following error:

ValueError:平面形状未对齐

ValueError: Plan shapes are not aligned

我想知道如何用df_A给出的列顺序连接两个数据框.

I'd like to know how to concatenate the two dataframes with the column order given by df_A.

推荐答案

我已经尝试过了,并且源或目标定义的DataFrame中是否有更多列都没有关系-无论哪种方式,结果都是一个dataframe由所有提供的列的并集组成(目标中指定了空列,但未填充由NaN填充的源).

I've tried this and it doesn't matter whether there are more columns in the source, or target defined DataFrame - either way, the result is a dataframe that consists of a union of all supplied columns (with empty columns specified in the target, but not populated by the source populated with NaN).

我能够重现您的错误的地方是源或目标数据框中的列名称都包含重复名称(或空列名称).

Where I have been able to reproduce your error is where the column names in either the source or target dataframe include a duplicate name (or empty column names).

在您的示例中,各种列在源文件中多次出现.我认为concat不能很好地应对这类重复的列.

In your example, various columns appear more than once in your source file. I don't think concat copes very well with these kinds of duplicate columns.

import pandas as pd
s1 = [0,1,2,3,4,5]
s2 = [0,0,0,0,1,1]
A = pd.DataFrame([s2,s1],columns=['A','B','C','D','E','F'])

结果:


A B C D E F
-----------
0 0 0 0 1 1 
0 1 2 3 4 5 

获取列的子集,并使用它们创建一个名为B的新数据框.

Take a subset of columns and use them to create a new dataframe called B

B = A[['A','C','E']]

 

A C E
-----
0 0 1 
0 2 4 

创建一个新的空目标数据框

Create a new empty target dataframe

col_names = ['D','A','C','B']
Z = pd.DataFrame(columns=col_names)


D A C B
-------

并连接两个:

Z = pd.concat([B,Z],axis=0)


A  C  D   E
0  0  NaN 1 
0  2  NaN 4 

很好!

但是如果我这样使用列重新创建空的数据框:

But if I recreate the empty dataframe using columns as so:

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)


    D A C D

并尝试串联:

col_names = ['D','A','C','D']
Z = pd.DataFrame(columns=col_names)

然后我得到您描述的错误.

Then I get the error you describe.

这篇关于连接具有不同列顺序的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆