Pandas.read_excel KeyError读取一组xlsx文件 [英] Pandas.read_excel KeyError on reading set of xlsx files

查看:146
本文介绍了Pandas.read_excel KeyError读取一组xlsx文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Anaconda shell进行数据分析升级到熊猫一堆Excel文件(25个文件)在此文件上 https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 我出错了.无法找到原因以及如何修复它.

I'm using Anaconda shell for data analitycs Uploding to pandas bunch of excel files (25 files) On This files https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 I get error. Cant find the reason why and how to fix it.

import pandas as pd
import numpy as np
import os

os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata" )
path = os.getcwd()

files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']

for f in files_xlsx:
    print(f)
    loading = pd.read_excel(f, heading = 0)
    df = df.append(loading)
2016-06-20–2016-06-26.xlsx
2016-06-27–2016-07-03.xlsx
2016-07-04–2016-07-10.xlsx
2016-07-11–2016-07-17.xlsx
2016-08-01–2016-08-07.xlsx
2016-08-15–2016-08-21.xlsx


KeyError                                  Traceback (most recent call last)
<ipython-input-23-5737d4d13b9f> in <module>()
      1 df = pd.DataFrame()
----> 2 pd.read_excel('2016-08-15–2016-08-21.xlsx')

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, engine, squeeze, **kwds)
    189 
    190     if not isinstance(io, ExcelFile):
--> 191         io = ExcelFile(io, engine=engine)
    192 
    193     return io._parse_excel(

C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in __init__(self, io, **kwds)
    247             self.book = xlrd.open_workbook(file_contents=data)
    248         elif isinstance(io, compat.string_types):
--> 249             self.book = xlrd.open_workbook(io)
    250         else:
    251             raise ValueError('Must explicitly set engine if not passing in'

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    420                 formatting_info=formatting_info,
    421                 on_demand=on_demand,
--> 422                 ragged_rows=ragged_rows,
    423                 )
    424             return bk

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
    831         x12sheet = X12Sheet(sheet, logfile, verbosity)
    832         heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833         x12sheet.process_stream(zflo, heading)
    834         del zflo
    835 

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in own_process_stream(self, stream, heading)
    546         for event, elem in ET.iterparse(stream):
    547             if elem.tag == row_tag:
--> 548                 self_do_row(elem)
    549                 elem.clear() # destroy all child elements (cells)
    550             elif elem.tag == U_SSML12 + "dimension":

C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in do_row(self, row_elem)
    743                     else:
    744                         bad_child_tag(child_tag)
--> 745                 value = error_code_from_text[tvalue]
    746                 self.sheet.put_cell(rowx, colx, XL_CELL_ERROR, value, xf_index)
    747             elif cell_type == "inlineStr":

KeyError: None

推荐答案

实际上,@ MaxU指出,Excel文件格式错误,但有趣的是,如果正确保存为.xlsx文件,它也可以解析.可能是通过简单地将扩展名更改为.xlsx来尝试从无效的.xls版本升级无效文件.这两种文件格式不是简单的文本文件,可以安全地更改扩展名,但是二进制格式却大不相同.

Indeed as @MaxU points out, Excel files are malformed but interestingly does resolve when properly saved as an .xlsx file. Possibly, the invalid files were attempted to be upgraded from previous .xls version by simply changing the extension to .xlsx. These two file formats are not simple text files that can change extension without hazard but very different binary formats.

考虑使用 wn32com 模块运行COM接口,以使用Excel的

Consider running a COM interface using wn32com module to properly save the malformed files to actual OpenXML workbooks using Excel's Workbook.SaveAs method. Note: this solution is only compliant in Python for Windows with installed MS Excel.

import pandas as pd
import glob
import win32com.client as win32

xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx")

def save_xlsx(srcfile):
    try:
        newfile = srcfile.replace('.xlsx', '_new.xlsx')
        print('Malformed file saved as {}'.format(newfile))
        xlApp = win32.gencache.EnsureDispatch('Excel.Application')
        wb = xlApp.Workbooks.Open(srcfile)
        wb.SaveAs(newfile, 51)                 

    except Exception as e:
        print(e)            
    finally:
        wb.Close(True); wb = None
        xlApp.Quit; xlApp = None    
    return newfile

def xl_read():    
    dfs = []
    for f in xlsxfiles:        
        try:
            df = pd.read_excel(f)
        except Exception as e:            
            df = pd.read_excel(save_xlsx(f))

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)            
    return pd.concat(dfs)

print('Final dataframe shape: {}'.format(xl_read().shape))  

输出 (330,257行和30列的最终数据帧)

File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27)
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27)
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27)
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28)
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28)
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27)
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27)
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27)
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27)
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27)
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27)
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27)
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27)
Final dataframe shape: (330257, 30)


甚至考虑通过 pyodbc 使用Windows的ACE Engine的数据库引擎方法来查询熊猫


Consider even a database engine approach using Windows' ACE Engine via pyodbc to query corresponding workbooks with pandas read_sql since each share same sheet name, TDSheet.

#...same as above
import pyodbc

def sql_read():    
    dfs = [] 
    for f in xlsxfiles:                
        try:
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True)
            df =  pd.read_sql('SELECT * FROM [TDSheet$];', conn)

        except Exception as e:
            conn.close()
            conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
                      'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True)
            df = pd.read_excel('SELECT * FROM [TDSheet$];', conn)
            conn.close()

        print('File: {}, Shape: {}'.format(f, df.shape))
        dfs.append(df)

这篇关于Pandas.read_excel KeyError读取一组xlsx文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆