使用Pandas读取Excel XML .xls文件 [英] Read Excel XML .xls file with pandas

查看:205
本文介绍了使用Pandas读取Excel XML .xls文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道许多先前提出的问题,但是给出的解决方案都无法在下面提供的可重现示例中起作用.

I'm aware of a number of previously asked questions, but none of the solutions given work on the reproducible example that I provide below.

我正在尝试从 http://www.eia读取.xls文件.gov/coal/data.cfm#production -特别是历史详细煤炭生产数据(1983-2013) coalpublic2012.xls文件,该文件可通过下拉菜单免费获得.熊猫看不懂.

I am trying to read in .xls files from http://www.eia.gov/coal/data.cfm#production -- specifically the Historical detailed coal production data (1983-2013) coalpublic2012.xls file that's freely available via the dropdown. Pandas cannot read it.

相反,最近可用的文件2013 coalpublic2013.xls可以正常工作:

In contrast, the file for the most recent year available, 2013, coalpublic2013.xls file, works without a problem:

import pandas as pd
df1 = pd.read_excel("coalpublic2013.xls")

,但不会加载下一个十年的.xls文件(2004-2012).我已经用Excel查看了这些文件,它们已经打开,并且没有损坏.

but the next decade of .xls files (2004-2012) do not load. I have looked at these files with Excel, and they open, and are not corrupted.

我从熊猫那里得到的错误是:

The error that I get from pandas is:

---------------------------------------------------------------------------
XLRDError                                 Traceback (most recent call last)
<ipython-input-28-0da33766e9d2> in <module>()
----> 1 df = pd.read_excel("coalpublic2012.xlsx")

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in read_excel(io, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, convert_float, has_index_names, converters, engine, **kwds)
    161 
    162     if not isinstance(io, ExcelFile):
--> 163         io = ExcelFile(io, engine=engine)
    164 
    165     return io._parse_excel(

/Users/jonathan/anaconda/lib/python2.7/site-packages/pandas/io/excel.pyc in __init__(self, io, **kwds)
    204                 self.book = xlrd.open_workbook(file_contents=data)
    205             else:
--> 206                 self.book = xlrd.open_workbook(io)
    207         elif engine == 'xlrd' and isinstance(io, xlrd.Book):
    208             self.book = io

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/__init__.pyc in open_workbook(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
    433         formatting_info=formatting_info,
    434         on_demand=on_demand,
--> 435         ragged_rows=ragged_rows,
    436         )
    437     return bk

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in open_workbook_xls(filename, logfile, verbosity, use_mmap, file_contents, encoding_override, formatting_info, on_demand, ragged_rows)
     89         t1 = time.clock()
     90         bk.load_time_stage_1 = t1 - t0
---> 91         biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
     92         if not biff_version:
     93             raise XLRDError("Can't determine file's BIFF version")

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in getbof(self, rqd_stream)
   1228             bof_error('Expected BOF record; met end of file')
   1229         if opcode not in bofcodes:
-> 1230             bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
   1231         length = self.get2bytes()
   1232         if length == MY_EOF:

/Users/jonathan/anaconda/lib/python2.7/site-packages/xlrd/book.pyc in bof_error(msg)
   1222         if DEBUG: print("reqd: 0x%04x" % rqd_stream, file=self.logfile)
   1223         def bof_error(msg):
-> 1224             raise XLRDError('Unsupported format, or corrupt file: ' + msg)
   1225         savpos = self._position
   1226         opcode = self.get2bytes()

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

我还尝试了其他各种方法:

And I have tried various other things:

df = pd.ExcelFile("coalpublic2012.xls", encoding_override='cp1252')
import xlrd
wb = xlrd.open_workbook("coalpublic2012.xls")

无济于事.我的熊猫版本:0.17.0

to no avail. My pandas version: 0.17.0

我也已将此错误作为错误提交给了熊猫github 问题列表.

I've also submitted this as a bug to the pandas github issues list.

推荐答案

您可以通过编程方式转换此Excel XML文件.要求:仅python和pandas.

You can convert this Excel XML file programmatically. Requirement: only python and pandas.

import pandas as pd
from xml.sax import ContentHandler, parse

# Reference https://goo.gl/KaOBG3
class ExcelHandler(ContentHandler):
    def __init__(self):
        self.chars = [  ]
        self.cells = [  ]
        self.rows = [  ]
        self.tables = [  ]
    def characters(self, content):
        self.chars.append(content)
    def startElement(self, name, atts):
        if name=="Cell":
            self.chars = [  ]
        elif name=="Row":
            self.cells=[  ]
        elif name=="Table":
            self.rows = [  ]
    def endElement(self, name):
        if name=="Cell":
            self.cells.append(''.join(self.chars))
        elif name=="Row":
            self.rows.append(self.cells)
        elif name=="Table":
            self.tables.append(self.rows)

excelHandler = ExcelHandler()
parse('coalpublic2012.xls', excelHandler)
df1 = pd.DataFrame(excelHandler.tables[0][4:], columns=excelHandler.tables[0][3])

这篇关于使用Pandas读取Excel XML .xls文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆