使用 Pandas 从 URL 读取 excel 文件 - XLRDError [英] using Pandas to read in excel file from URL - XLRDError

查看:91
本文介绍了使用 Pandas 从 URL 读取 excel 文件 - XLRDError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下 URL 读取 excel 文件到 Pandas:

I am trying to read in excel files to Pandas from the following URLs:

url1 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls'

url2 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/STTI_Historical.xls'

使用代码:

pd.read_excel(url1)

但是它不起作用并且我收到错误:

However it doesn't work and I get the error:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '2000/01/'

在 Google 上搜索后,似乎有时通过 URL 提供的 .xls 文件实际上在幕后以不同的文件格式保存,例如 html 或 xml.

After searching on Google it seems that sometimes .xls files offered through URLs are actually held in a different file format behind the scenes such as html or xml.

当我手动下载 Excel 文件并使用 Excel 打开它时,我收到一条错误消息:文件格式和扩展名不匹配.该文件可能已损坏或不安全.除非你相信它的来源,否则不要打开它"

When I manually download the excel file and open it using Excel I get presented with an error message: The file format and extension don't match. The file could be corrupted or unsafe. Unless you trust it's source don't open it"

当我打开它时,它看起来就像一个普通的 Excel 文件.

When I do open it, it appears just like a normal excel file.

我在网上看到一个帖子,建议我在文本编辑器中打开文件,看看是否有任何关于正确文件格式的附加信息,但在使用记事本 ++ 打开时我没有看到任何附加信息.

I came across a post online that suggested I open the file in a text editor to see if there is any additional info held as to proper file format but I don't see any additional info when opened using notepad++.

有人可以帮我把这个xls"文件正确读入熊猫 DataFramj 吗?

Could someone please help me get this "xls" file read into a pandas DataFramj properly please?

推荐答案

看来你可以使用 read_csv:

It seems you can use read_csv:

import pandas as pd

df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
                 sep='\t',
                 parse_dates=[0],
                 names=['a','b','c','d','e','f'])
print df

然后我检查最后一列 f 是否还有其他值作为 NaN:

Then I check last column f if there are some other values as NaN:

print df[df.f.notnull()]

Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []

所以只有NaN,所以你可以通过参数usecols过滤最后一列f:

So there are only NaN, so you can filter last column f by parameter usecols:

import pandas as pd

df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
                 sep='\t',
                 parse_dates=[0],
                 names=['a','b','c','d','e','f'],
                 usecols=['a','b','c','d','e'])
print df

这篇关于使用 Pandas 从 URL 读取 excel 文件 - XLRDError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆