使用大 pandas 格式化日期数据不一致 [英] formatting inconsistent date data with pandas

查看:144
本文介绍了使用大 pandas 格式化日期数据不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何处理熊猫数据格式不一致的问题。最初,我使用正则表达式从大型数据集的URL中提取日期。这是非常有效的,但提取日期之间的日期格式不一致:

I'm wondering how I might approach the problem of inconsistent data formats with pandas. Initially I used regular expression to extract a date from a large data set of urls. That worked great however there is an inconsistent date format among the extracted dates:

dates
20140609
20140624
20140404
3/18/14
3/10/14
3/14/2014
20140807
20140806
2014-07-18

如您所见,该数据集中的日期数据格式不一致。有没有办法解决这个格式,所有日期格式相同?

As you can see there is an inconsistent formatting of the date data in this dataset. Is there a way to fix this formatting so that all the dates are of the same format?

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122270 entries, 0 to 122269
Data columns (total 4 columns):
id                  119534 non-null float64
x1                  122270 non-null int64
url                 122270 non-null object
date                122025 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 4.7+ MB


推荐答案

使用 to_datetime 似乎男人/女人足以处理你的不一致的格式:

Use to_datetime it seems man/woman enough to handle your inconsistent formatting:

In [77]:

df['dates'] = pd.to_datetime(df['dates'])
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 1 columns):
dates    9 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [78]:

df
Out[78]:
       dates
0 2014-06-09
1 2014-06-24
2 2014-04-04
3 2014-03-18
4 2014-03-10
5 2014-03-14
6 2014-08-07
7 2014-08-06
8 2014-07-18

对于您的示例数据集 to_datetime 可以正常工作,如果它不适合您,那将是因为您有一些格式可以不能转换,您可以设置参数 coerce = True ,将设置无法转换为 NaT 的值或 errors ='raise'报告任何问题。

For your sample dataset to_datetime works fine, if it didn't work for you it will be because you have some formats that it can't convert, you can either set the param coerce=True which will set any values that cannot be converted to NaT or errors='raise' to report any problems.

这篇关于使用大 pandas 格式化日期数据不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆