使用 pandas 阅读具有多个标题的excel表 [英] Read excel sheet with multiple header using Pandas

查看:149
本文介绍了使用 pandas 阅读具有多个标题的excel表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  _________________________________________________________________________ 
____ | _____ | Header1 | Header2 | Header3 |
ColX | ColY | ColA | ColB | ColC | ColD || ColD | ColE | ColF | ColG || ColH | ColI | ColJ | ColDK |
1 | ds | 5 | 6 | 9 | 10 | .......................................
2 | dh | .................................................. ........
3 | ge | .................................................. ........
4 | ew | .................................................. ........
5 |呃| .................................................. ........

现在,您可以看到前两列没有标题它们是空白的,但其他列的标题如Header1,Header2和Header3。所以我想阅读这张表格,并将其与其他具有相似结构的表格进行合并。



我想将它合并在第一列ColX上。现在我这样做:

 导入熊猫为pd 

totalMergedSheet = pd.DataFrame([ 1,2,3,4,5],columns = ['ColX'])
file = pd.ExcelFile('ExcelFile.xlsx')
for i in range(1,len(file。 sheet_names))
df1 = file.parse(file.sheet_names [i-1])$ ​​b $ b df2 = file.parse(file.sheet_names [i])
newMergedSheet = pd.merge df1,df2,on ='ColX')
totalMergedSheet = pd.merge(totalMergedSheet,newMergedSheet,on ='ColX')

但是我不知道它既不正确地阅读列,我认为不会按照我想要的方式返回结果。所以,我想要的结果框架应该是:

  ________________________________________________________________________________________________ 
____ | _____ | Header1 | Header2 | Header3 | Header4 | Header5 |
ColX | ColY | ColA | ColB | ColC | ColD || ColD | ColE | ColF | ColG || ColH | ColI | ColJ | ColK |科尔|科尔姆|科隆| COLO || COLP | COLQ | COLR |列数|
1 | ds | 5 | 6 | 9 | 10 | .................................................. ................................
2 | dh | .................................................. .................................
3 | ge | .................................................. ..................................
4 | ew | .................................................. .................................
5 |呃| .................................................. ....................................

任何建议请。谢谢。

解决方案

熊猫已经有一个功能,将读取您的整个Excel电子表格,所以你不需要手动解析/合并每张纸。看看 pandas.read_excel()。它不仅可以让您在一行中阅读Excel文件,还可以提供帮助解决您遇到问题的选项。



由于您有子列,什么您正在寻找的是 MultiIndexing 。默认情况下,熊猫将作为唯一的标题行读取顶行。您可以将参数传递到 pandas.read_excel()中,表示将要使用多少行作为标题。在特定情况下,您需要 header = [0,1] ,表示前两行。您可能还有多张表格,所以您也可以通过 sheetname = None (这将告诉它通过所有表格)。该命令将是:

  df_dict = pandas.read_excel('ExcelFile.xlsx',header = [0,1] =无)

这将返回一个字典,其中键是工作表名称,值是DataFrames对于每张纸。如果要将其全部折叠到一个DataFrame中,可以使用pandas.concat:

  df = pandas.concat(df_dict .values(),axis = 0)


I have an excel sheet with multiple header like:

_________________________________________________________________________
____|_____|        Header1    |        Header2     |        Header3      |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1   | ds  | 5  | 6  |9   |10  | .......................................
2   | dh  |  ..........................................................
3   | ge  |  ..........................................................
4   | ew  |  ..........................................................
5   | er  |  ..........................................................

Now here you can see that first two columns do not have headers they are blank but other columns have headers like Header1, Header2 and Header3. So I want to read this sheet and merge it with other sheet with similar structure.

I want to merge it on first column 'ColX'. Right now I am doing this:

import pandas as pd

totalMergedSheet = pd.DataFrame([1,2,3,4,5], columns=['ColX'])
file = pd.ExcelFile('ExcelFile.xlsx')
for i in range (1, len(file.sheet_names)):
    df1 = file.parse(file.sheet_names[i-1])
    df2 = file.parse(file.sheet_names[i])
    newMergedSheet = pd.merge(df1, df2, on='ColX')
    totalMergedSheet = pd.merge(totalMergedSheet, newMergedSheet, on='ColX')

But I don't know its neither reading columns correctly and I think will not return the results in the way I want. So, I want the resulting frame should be like:

________________________________________________________________________________________________________
____|_____|        Header1    |        Header2     |        Header3      |        Header4     |        Header5      |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColK| ColL|ColM|ColN|ColO||ColP|ColQ|ColR|ColS|
1   | ds  | 5  | 6  |9   |10  | ..................................................................................
2   | dh  |  ...................................................................................
3   | ge  |  ....................................................................................
4   | ew  |  ...................................................................................
5   | er  |  ......................................................................................

Any suggestions please. Thanks.

解决方案

Pandas already has a function that will read in an entire Excel spreadsheet for you, so you don't need to manually parse/merge each sheet. Take a look pandas.read_excel(). It not only lets you read in an Excel file in a single line, it also provides options to help solve the problem you're having.

Since you have subcolumns, what you're looking for is MultiIndexing. By default, pandas will read in the top row as the sole header row. You can pass a header argument into pandas.read_excel() that indicates how many rows are to be used as headers. In your particular case, you'd want header=[0, 1], indicating the first two rows. You might also have multiple sheets, so you can pass sheetname=None as well (this tells it to go through all sheets). The command would be:

df_dict = pandas.read_excel('ExcelFile.xlsx', header=[0, 1], sheetname=None)

This returns a dictionary where the keys are the sheet names, and the values are the DataFrames for each sheet. If you want to collapse it all into one DataFrame, you can simply use pandas.concat:

df = pandas.concat(df_dict.values(), axis=0)

这篇关于使用 pandas 阅读具有多个标题的excel表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆