使用python pandas解析CSV,格式为Year,Day,hour,Min,Sec [英] Using python pandas to parse CSV with date in format Year, Day, Hour, Min, Sec

查看:221
本文介绍了使用python pandas解析CSV,格式为Year,Day,hour,Min,Sec的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个CSV文件,格式为:

I have several CSV files with the format:

Year,Day,Hour,Min,Sec.,P1'S1
 2003,  1, 0, 0,12.22, 0.541
 2003,  1, 1, 0,20.69, 0.708
 2003,  1, 2, 0, 4.95, 0.520
 2003,  1, 3, 0,13.42, 0.539
...

(其中 day ,是一年中的一天),我正在尝试使用 pandas 库(似乎是一个到目前为止还不错).

(where day, is the day of the year) and I'm trying to read them using the pandas library (seems a fantastic lib so far).

有一个内置函数可以读取大熊猫中的CSV,甚至更好的是,该函数应该检查日期类型的列.并自动将其用作索引(这对我的工作而言将是完美的选择).

There is a built-in function to read CSV in pandas, and even better, that function supposedly checks the columns for a date type. and automatically uses that as an index (which would be exactly perfect for what I'm doing).

问题是,我无法使用这种格式的日期数据.

The thing is, I cannot get it to work with date data in this format.

我尝试过:

data = pd.read_csv("csvFile.csv", index_col=[0, 1],  , index_col=[0, 1, 2, 3, 4] parse_dates=True)

只能正确获取年份:

In [36]: data.index
Out[36]: 
MultiIndex
[(<Timestamp: 2003-09-04 00:00:00>, 1, 0, 0, 12.22)
 (<Timestamp: 2003-09-04 00:00:00>, 1, 1, 0, 20.69)
 (<Timestamp: 2003-09-04 00:00:00>, 1, 2, 0, 4.95) ...,
 (<Timestamp: 2003-09-04 00:00:00>, 365, 21, 0, 3.77)
 (<Timestamp: 2003-09-04 00:00:00>, 365, 22, 0, 14.6)
 (<Timestamp: 2003-09-04 00:00:00>, 365, 23, 0, 13.36)]

从文档中,我看到您可以在熊猫的read_csv函数中指定"date_parser"属性.但是文档没有显示如何操作,我无法弄清楚. 任何在该主题上有经验的人都可以伸出援手.

From the documentation, I see that you can specify the "date_parser" attribute in the read_csv function of pandas. But the documentation doesn't show how and I'm not being able to figure it out. Anyone with experience in the subject that can give a hand.

干杯, 布鲁诺

推荐答案

为了解析多列日期,您需要告诉pandas哪些列应合并为一个日期,因此您需要说parse_dates=['Year','Day','Hour','Min','Sec']

In order to parse a multi-column date, you need to tell pandas which columns should be combined into a single date, so you need to say parse_dates=['Year','Day','Hour','Min','Sec']

您还需要定义自己的解析器,该解析器从parse_dates中指定的每一列中获取一个元素:

You also need to define your own parser that takes a element from each column you specified in parse_dates:

In [1]: import pandas as pd

In [2]: from datetime import datetime, timedelta

In [3]: from cStringIO import StringIO

In [4]: data = """\
Year,Day,Hour,Min,Sec.,P1'S1
 2003,  1, 0, 0,12.22, 0.541
 2003,  1, 1, 0,20.69, 0.708
 2003,  1, 2, 0, 4.95, 0.520
 2003,  1, 3, 0,13.42, 0.539
"""

In [5]: def parse(yr, doy, hr, min, sec):
    yr, doy, hr, min = [int(x) for x in [yr, doy, hr, min]]
    sec = float(sec)
    mu_sec = int((sec - int(sec)) * 1e6)
    sec = int(sec)
    dt = datetime(yr - 1, 12, 31)
    delta = timedelta(days=doy, hours=hr, minutes=min, seconds=sec,
                      microseconds=mu_sec)
    return dt + delta
   ...: 

In [6]: pd.read_csv(StringIO(data), parse_dates={'datetime':      
           ['Year','Day','Hour','Min','Sec.']}, 
           date_parser=parse, index_col='datetime')
Out[6]: 
                            P1'S1
datetime                         
2003-01-01 00:00:12.220000  0.541
2003-01-01 01:00:20.690000  0.708
2003-01-01 02:00:04.950000  0.520
2003-01-01 03:00:13.419999  0.539

这篇关于使用python pandas解析CSV,格式为Year,Day,hour,Min,Sec的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆