Pandas - 自动检测日期列**在运行时** [英] Pandas - automatically detect date columns **at run time**
问题描述
我想知道 pandas 是否能够自动检测哪些列是日期时间对象并将这些列作为日期而不是字符串读取?
我正在查看 api 和相关的堆栈溢出帖子,但我似乎无法弄清楚.
这是一个黑盒系统,它在生产中接受任意 csv 模式,所以我不知道列名是什么.
这似乎可行,但您必须知道哪些列是日期字段:
将pandas导入为pd#创建测试数据df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})df.to_csv('test.csv', index=False)#加载测试数据df = pd.read_csv('test.csv', parse_dates=True)打印 df.dtypes# 打印 (object, object, int64) 而不是 (object,datetime, int64)
<块引用>
我在想如果它不能做到这一点,那么我可以写一些东西:
- 查找字符串类型的列.
- 获取一些唯一值并尝试解析它们.
- 如果成功,则尝试解析整列.
编辑.我写了一个简单的方法 convertDateColumns
可以做到这一点:
将pandas导入为pd从 dateutil 导入解析器def convertDateColumns(self, df):object_cols = df.columns.values[df.dtypes.values == 'object']date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]对于 date_cols 中的 col:尝试:df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)除了值错误:经过返回 dfdef testIfColumnIsDate(series, num_tries=4):""" 测试列是否包含日期值.对于日期列可能有的场景,这可以尝试几次几个空值或缺失值,但我们仍然想在什么时候解析可能(并将这些空值/缺失值转换为 NaD 值)"""如果 series.dtype != 'object':返回错误vals = 设置()对于 val 系列:vals.add(val)如果 len(vals) >次数:休息对于列表中的 val(vals):尝试:如果 type(val) 是 int:继续parser.parse(val)返回真除了值错误:经过返回错误
我会使用 pd.to_datetime
,并在不工作的列上捕获异常.例如:
将pandas导入为pddf = pd.read_csv('test.csv')对于 df.columns 中的 col:如果 df[col].dtype == 'object':尝试:df[col] = pd.to_datetime(df[col])除了值错误:经过df.dtypes# (对象, datetime64[ns], int64)
我相信这与此应用程序所能达到的自动"非常接近.
I was wondering if pandas is capable of automatically detecting which columns are datetime objects and read those columns in as dates instead of strings?
I am looking at the api and related stack overflow posts but I can't seem to figure it out.
This is a black-box system that takes in arbitrary csv schema on production so I do not what the column names will be.
This seems like it would work but you have to know which columns are date fields:
import pandas as pd
#creating the test data
df = pd.DataFrame({'0': ['a', 'b', 'c'], '1': ['2015-12-27','2015-12-28', '2015-12-29'], '2': [11,12,13]})
df.to_csv('test.csv', index=False)
#loading the test data
df = pd.read_csv('test.csv', parse_dates=True)
print df.dtypes
# prints (object, object, int64) instead of (object,datetime, int64)
I am thinking if it cannot do this, then I can write something that:
- Finds columns with string type.
- Grab a few unique values and try to parse them.
- If successful then try to parse the whole column.
Edit. I wrote a simple method convertDateColumns
that will do this:
import pandas as pd
from dateutil import parser
def convertDateColumns(self, df):
object_cols = df.columns.values[df.dtypes.values == 'object']
date_cols = [c for c in object_cols if testIfColumnIsDate(df[c], num_tries=3)]
for col in date_cols:
try:
df[col] = pd.to_datetime(df[col], coerce=True, infer_datetime_format=True)
except ValueError:
pass
return df
def testIfColumnIsDate(series, num_tries=4):
""" Test if a column contains date values.
This can try a few times for the scenerio where a date column may have
a couple of null or missing values but we still want to parse when
possible (and convert those null/missing to NaD values)
"""
if series.dtype != 'object':
return False
vals = set()
for val in series:
vals.add(val)
if len(vals) > num_tries:
break
for val in list(vals):
try:
if type(val) is int:
continue
parser.parse(val)
return True
except ValueError:
pass
return False
I would use pd.to_datetime
, and catch exceptions on columns that don't work. For example:
import pandas as pd
df = pd.read_csv('test.csv')
for col in df.columns:
if df[col].dtype == 'object':
try:
df[col] = pd.to_datetime(df[col])
except ValueError:
pass
df.dtypes
# (object, datetime64[ns], int64)
I believe this is as close to "automatic" as you can get for this application.
这篇关于Pandas - 自动检测日期列**在运行时**的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!