Pandas.read_excel KeyError读取一组xlsx文件 [英] Pandas.read_excel KeyError on reading set of xlsx files
问题描述
我正在使用Anaconda shell进行数据分析升级到熊猫一堆Excel文件(25个文件)在此文件上 https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 我出错了.无法找到原因以及如何修复它.
I'm using Anaconda shell for data analitycs Uploding to pandas bunch of excel files (25 files) On This files https://www.dropbox.com/s/16ea1cw6k63i16p/Newdata.zip?dl=0 I get error. Cant find the reason why and how to fix it.
import pandas as pd
import numpy as np
import os
os.chdir(r"C:\Users\Twentyouts\Desktop\Newdata" )
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
for f in files_xlsx:
print(f)
loading = pd.read_excel(f, heading = 0)
df = df.append(loading)
2016-06-20–2016-06-26.xlsx
2016-06-27–2016-07-03.xlsx
2016-07-04–2016-07-10.xlsx
2016-07-11–2016-07-17.xlsx
2016-08-01–2016-08-07.xlsx
2016-08-15–2016-08-21.xlsx
KeyError Traceback (most recent call last)
<ipython-input-23-5737d4d13b9f> in <module>()
1 df = pd.DataFrame()
----> 2 pd.read_excel('2016-08-15–2016-08-21.xlsx')
C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, names, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, true_values, false_values, engine, squeeze, **kwds)
189
190 if not isinstance(io, ExcelFile):
--> 191 io = ExcelFile(io, engine=engine)
192
193 return io._parse_excel(
C:\Users\Twentyouts\Anaconda3\lib\site-packages\pandas\io\excel.py in __init__(self, io, **kwds)
247 self.book = xlrd.open_workbook(file_contents=data)
248 elif isinstance(io, compat.string_types):
--> 249 self.book = xlrd.open_workbook(io)
250 else:
251 raise ValueError('Must explicitly set engine if not passing in'
C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\__init__.py in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
420 formatting_info=formatting_info,
421 on_demand=on_demand,
--> 422 ragged_rows=ragged_rows,
423 )
424 return bk
C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in open_workbook_2007_xml(zf, component_names, logfile, verbosity, use_mmap, formatting_info, on_demand, ragged_rows)
831 x12sheet = X12Sheet(sheet, logfile, verbosity)
832 heading = "Sheet %r (sheetx=%d) from %r" % (sheet.name, sheetx, fname)
--> 833 x12sheet.process_stream(zflo, heading)
834 del zflo
835
C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in own_process_stream(self, stream, heading)
546 for event, elem in ET.iterparse(stream):
547 if elem.tag == row_tag:
--> 548 self_do_row(elem)
549 elem.clear() # destroy all child elements (cells)
550 elif elem.tag == U_SSML12 + "dimension":
C:\Users\Twentyouts\Anaconda3\lib\site-packages\xlrd\xlsx.py in do_row(self, row_elem)
743 else:
744 bad_child_tag(child_tag)
--> 745 value = error_code_from_text[tvalue]
746 self.sheet.put_cell(rowx, colx, XL_CELL_ERROR, value, xf_index)
747 elif cell_type == "inlineStr":
KeyError: None
推荐答案
实际上,@ MaxU指出,Excel文件格式错误,但有趣的是,如果正确保存为.xlsx文件,它也可以解析.可能是通过简单地将扩展名更改为.xlsx来尝试从无效的.xls版本升级无效文件.这两种文件格式不是简单的文本文件,可以安全地更改扩展名,但是二进制格式却大不相同.
Indeed as @MaxU points out, Excel files are malformed but interestingly does resolve when properly saved as an .xlsx file. Possibly, the invalid files were attempted to be upgraded from previous .xls version by simply changing the extension to .xlsx. These two file formats are not simple text files that can change extension without hazard but very different binary formats.
考虑使用 wn32com
模块运行COM接口,以使用Excel的
Consider running a COM interface using wn32com
module to properly save the malformed files to actual OpenXML workbooks using Excel's Workbook.SaveAs method. Note: this solution is only compliant in Python for Windows with installed MS Excel.
import pandas as pd
import glob
import win32com.client as win32
xlsxfiles = glob.glob("C:\\Path\\To\\Workbooks\\*.xlsx")
def save_xlsx(srcfile):
try:
newfile = srcfile.replace('.xlsx', '_new.xlsx')
print('Malformed file saved as {}'.format(newfile))
xlApp = win32.gencache.EnsureDispatch('Excel.Application')
wb = xlApp.Workbooks.Open(srcfile)
wb.SaveAs(newfile, 51)
except Exception as e:
print(e)
finally:
wb.Close(True); wb = None
xlApp.Quit; xlApp = None
return newfile
def xl_read():
dfs = []
for f in xlsxfiles:
try:
df = pd.read_excel(f)
except Exception as e:
df = pd.read_excel(save_xlsx(f))
print('File: {}, Shape: {}'.format(f, df.shape))
dfs.append(df)
return pd.concat(dfs)
print('Final dataframe shape: {}'.format(xl_read().shape))
输出 (330,257行和30列的最终数据帧)
File: C:\Path\To\Workbooks\2016-06-20–2016-06-26.xlsx, Shape: (5912, 27)
File: C:\Path\To\Workbooks\2016-06-27–2016-07-03.xlsx, Shape: (5362, 27)
File: C:\Path\To\Workbooks\2016-07-04–2016-07-10.xlsx, Shape: (5387, 27)
File: C:\Path\To\Workbooks\2016-07-11–2016-07-17.xlsx, Shape: (5331, 28)
File: C:\Path\To\Workbooks\2016-08-01–2016-08-07.xlsx, Shape: (4965, 28)
Malformed file saved as C:\Path\To\Workbooks\2016-08-15–2016-08-21_new.xlsx
File: C:\Path\To\Workbooks\2016-08-15–2016-08-21.xlsx, Shape: (5315, 27)
File: C:\Path\To\Workbooks\2016-08-22–2016-08-28.xlsx, Shape: (5179, 27)
File: C:\Path\To\Workbooks\2016-08-29–2016-09-04.xlsx, Shape: (5855, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-05–2016-09-11_new.xlsx
File: C:\Path\To\Workbooks\2016-09-05–2016-09-11.xlsx, Shape: (5838, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-12–2016-09-18_new.xlsx
File: C:\Path\To\Workbooks\2016-09-12–2016-09-18.xlsx, Shape: (5729, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-09-19–2016-09-25_new.xlsx
File: C:\Path\To\Workbooks\2016-09-19–2016-09-25.xlsx, Shape: (6401, 27)
File: C:\Path\To\Workbooks\2016-09-26–2016-10-02.xlsx, Shape: (7018, 27)
File: C:\Path\To\Workbooks\2016-09.xlsx, Shape: (23874, 27)
File: C:\Path\To\Workbooks\2016-10-03–2016-10-09.xlsx, Shape: (6587, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-12_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-12.xlsx, Shape: (2883, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-10–2016-10-13_new.xlsx
File: C:\Path\To\Workbooks\2016-10-10–2016-10-13.xlsx, Shape: (4174, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-20_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-20.xlsx, Shape: (4560, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-17–2016-10-23_new.xlsx
File: C:\Path\To\Workbooks\2016-10-17–2016-10-23.xlsx, Shape: (7111, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-27_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-27.xlsx, Shape: (4921, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-24–2016-10-30_new.xlsx
File: C:\Path\To\Workbooks\2016-10-24–2016-10-30.xlsx, Shape: (8005, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10-31–2016-11-06_new.xlsx
File: C:\Path\To\Workbooks\2016-10-31–2016-11-06.xlsx, Shape: (7029, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-10_new.xlsx
File: C:\Path\To\Workbooks\2016-10.xlsx, Shape: (28098, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-07–2016-11-13_new.xlsx
File: C:\Path\To\Workbooks\2016-11-07–2016-11-13.xlsx, Shape: (7076, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-14–2016-11-20_new.xlsx
File: C:\Path\To\Workbooks\2016-11-14–2016-11-20.xlsx, Shape: (7758, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21.xlsx, Shape: (1689, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-21–2016-11-23_new.xlsx
File: C:\Path\To\Workbooks\2016-11-21–2016-11-23.xlsx, Shape: (4711, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11-28–2016-12-04_new.xlsx
File: C:\Path\To\Workbooks\2016-11-28–2016-12-04.xlsx, Shape: (9286, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-11_new.xlsx
File: C:\Path\To\Workbooks\2016-11.xlsx, Shape: (30505, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-05–2016-12-11_new.xlsx
File: C:\Path\To\Workbooks\2016-12-05–2016-12-11.xlsx, Shape: (8802, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-12–2016-12-18_new.xlsx
File: C:\Path\To\Workbooks\2016-12-12–2016-12-18.xlsx, Shape: (8333, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-16–2016-12-22_new.xlsx
File: C:\Path\To\Workbooks\2016-12-16–2016-12-22.xlsx, Shape: (8592, 27)
Malformed file saved as C:\Path\To\Workbooks\2016-12-26–2016-12-31_new.xlsx
File: C:\Path\To\Workbooks\2016-12-26–2016-12-31.xlsx, Shape: (5362, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-01–2017-01-08_new.xlsx
File: C:\Path\To\Workbooks\2017-01-01–2017-01-08.xlsx, Shape: (4322, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-09–2017-01-15_new.xlsx
File: C:\Path\To\Workbooks\2017-01-09–2017-01-15.xlsx, Shape: (7608, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-23–2017-01-29_new.xlsx
File: C:\Path\To\Workbooks\2017-01-23–2017-01-29.xlsx, Shape: (8903, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-01-30–2017-02-05_new.xlsx
File: C:\Path\To\Workbooks\2017-01-30–2017-02-05.xlsx, Shape: (9173, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-12_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-12.xlsx, Shape: (9144, 27)
Malformed file saved as C:\Path\To\Workbooks\2017-02-13–2017-02-19_new.xlsx
File: C:\Path\To\Workbooks\2017-02-13–2017-02-19.xlsx, Shape: (9911, 27)
File: C:\Path\To\Workbooks\test.xlsx, Shape: (5315, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 12-15.12_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 12-15.12.xlsx, Shape: (4818, 27)
Malformed file saved as C:\Path\To\Workbooks\Выгрузка 21-27_new.xlsx
File: C:\Path\To\Workbooks\Выгрузка 21-27.xlsx, Shape: (8876, 27)
File: C:\Path\To\Workbooks\Выгрузка 26-29.12.xlsx, Shape: (4539, 27)
Final dataframe shape: (330257, 30)
甚至考虑通过 pyodbc
使用Windows的ACE Engine的数据库引擎方法来查询熊猫
Consider even a database engine approach using Windows' ACE Engine via pyodbc
to query corresponding workbooks with pandas read_sql
since each share same sheet name, TDSheet.
#...same as above
import pyodbc
def sql_read():
dfs = []
for f in xlsxfiles:
try:
conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(f), autocommit=True)
df = pd.read_sql('SELECT * FROM [TDSheet$];', conn)
except Exception as e:
conn.close()
conn = pyodbc.connect('Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};'+\
'DBQ=C:\\Path\\To\\Workbooks\\{};'.format(save_xlsx(f)), autocommit=True)
df = pd.read_excel('SELECT * FROM [TDSheet$];', conn)
conn.close()
print('File: {}, Shape: {}'.format(f, df.shape))
dfs.append(df)
这篇关于Pandas.read_excel KeyError读取一组xlsx文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!