将从Excel读取的数据组织到Pandas DataFrame [英] Organizing data read from Excel to Pandas DataFrame

查看:652
本文介绍了将从Excel读取的数据组织到Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此脚本的目标是: 1.从excel文件(> 100,000万行)中读取timseries数据以及标头(标签,单位) 2.将Excel数字日期转换为熊猫dataFrame的最佳日期时间对象 3.能够使用时间戳来引用行,使用系列标签来引用列

My goal with this script is to: 1.read timseries data in from excel file (>100,000k rows) as well as headers (Labels, Units) 2.convert excel numeric dates to best datetime object for pandas dataFrame 3.Be able to use timestamps to reference rows and series labels to reference columns

到目前为止,我已使用xlrd将excel数据读取到列表中.用每个列表制作熊猫系列,并使用时间列表作为索引.将系列与系列标题结合使用,以制作python字典.将字典传递给pandas DataFrame.尽管我做了很多努力,但df.index似乎已设置为列标题,而且我不确定何时将日期转换为datetime对象.

So far I used xlrd to read the excel data into a list. Made pandas Series with each list and used time list as index. Combined series with series headers to make python dictionary. Passed dictionary to pandas DataFrame. Despite my efforts the df.index seems to be set to the column headers and I'm not sure when to convert the dates into datetime object.

3天前我才开始使用python,所以任何建议都很棒!这是我的代码:

I just started using python 3 days ago so any advice would be great! Here's my code:

    #Open excel workbook and first sheet
    wb = xlrd.open_workbook("C:\GreenCSV\Calgary\CWater.xlsx")
    sh = wb.sheet_by_index(0)

    #Read rows containing labels and units
    Labels = sh.row_values(1, start_colx=0, end_colx=None)
    Units = sh.row_values(2, start_colx=0, end_colx=None)

    #Initialize list to hold data
    Data = [None] * (sh.ncols)

    #read column by column and store in list
    for colnum in range(sh.ncols):
        Data[colnum] = sh.col_values(colnum, start_rowx=5, end_rowx=None)

    #Delete unecessary rows and columns
    del Labels[3],Labels[0:2], Units[3], Units[0:2], Data[3], Data[0:2]   

    #Create Pandas Series
    s = [None] * (sh.ncols - 4)
    for colnum in range(sh.ncols - 4):
        s[colnum] = Series(Data[colnum+1], index=Data[0])

    #Create Dictionary of Series
    dictionary = {}
    for i in range(sh.ncols-4):
        dictionary[i]= {Labels[i] : s[i]}

    #Pass Dictionary to Pandas DataFrame
    df = pd.DataFrame.from_dict(dictionary)

推荐答案

您可以在此处直接使用熊猫,我通常想创建一个DataFrames字典(键为工作表名称):

You can use pandas directly here, I usually like to create a dictionary of DataFrames (with keys being the sheet name):

In [11]: xl = pd.ExcelFile("C:\GreenCSV\Calgary\CWater.xlsx")

In [12]: xl.sheet_names  # in your example it may be different
Out[12]: [u'Sheet1', u'Sheet2', u'Sheet3']

In [13]: dfs = {sheet: xl.parse(sheet) for sheet in xl.sheet_names}

In [14]: dfs['Sheet1'] # access DataFrame by sheet name

您可以检出

You can check out the docs on the parse which offers some more options (for example skiprows), and these allows you to parse individual sheets with much more control...

这篇关于将从Excel读取的数据组织到Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆