在 pandas 中加载通用的Google电子表格 [英] Loading a generic Google Spreadsheet in Pandas
问题描述
当我尝试在熊猫中加载Google Spreadsheet
When I try to load a Google Spreadsheet in pandas
from StringIO import StringIO
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=<some_long_code>&output=csv')
data = r.content
df = pd.read_csv(StringIO(data), index_col=0)
我得到以下信息:
CParserError: Error tokenizing data. C error: Expected 1316 fields in line 73, saw 1386
为什么?我认为可以识别出包含数据的电子表格行和列,并将电子表格的行和列分别用作数据框的索引和列(NaN表示空白).为什么会失败?
Why? I would think that one could identify the spreadsheet set of rows and columns with data and use the spreadsheets rows and columns as the dataframe index and columns respectively (with NaN for anything empty). Why does it fail?
推荐答案
As one of the commentators noted you have not asked for the data in CSV format you have the "edit" request at the end of the url You can use this code and see it work on the spreadsheet (which by the way needs to be public..) It is possible to do private sheets as well but that is another topic.
from StringIO import StringIO # got moved around in python3 if you're using that.
import requests
r = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak1ecr7i0wotdGJmTURJRnZLYlV3M2daNTRubTdwTXc&output=csv')
data = r.content
In [10]: df = pd.read_csv(StringIO(data), index_col=0,parse_dates=['Quradate'])
In [11]: df.head()
Out[11]:
City region Res_Comm \
0 Dothan South_Central-Montgomery-Auburn-Wiregrass-Dothan Residential
10 Foley South_Mobile-Baldwin Residential
12 Birmingham North_Central-Birmingham-Tuscaloosa-Anniston Commercial
38 Brent North_Central-Birmingham-Tuscaloosa-Anniston Residential
44 Athens North_Huntsville-Decatur-Florence Residential
mkt_type Quradate National_exp Alabama_exp Sales_exp \
0 Rural 2010-01-15 00:00:00 2 2 3
10 Suburban_Urban 2010-01-15 00:00:00 4 4 4
12 Suburban_Urban 2010-01-15 00:00:00 2 2 3
38 Rural 2010-01-15 00:00:00 3 3 3
44 Suburban_Urban 2010-01-15 00:00:00 4 5 4
用于获取csv输出的新Google电子表格url格式为
The new Google spreadsheet url format for getting the csv output is
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&id
好吧,现在您需要再次更改网址格式:
Well they changed the url format slightly again now you need:
https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=0 #for the 1st sheet
我还发现我需要执行以下操作来对Python 3进行以上修改:
I also found I needed to do the following to deal with Python 3 a slight revision to the above:
from io import StringIO
并获取文件:
guid=0 #for the 1st sheet
act = requests.get('https://docs.google.com/spreadsheets/d/177_dFZ0i-duGxLiyg6tnwNDKruAYE-_Dd8vAQziipJQ/export?format=csv&gid=%s' % guid)
dataact = act.content.decode('utf-8') #To convert to string for Stringio
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=[0], thousands=',').sort()
actdf现在是带有标题(列名)的完整的熊猫数据框
actdf is now a full pandas dataframe with headers (column names)
这篇关于在 pandas 中加载通用的Google电子表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!