使用 Scikit-learn 对日期变量进行回归 [英] Regression with Date variable using Scikit-learn

查看:46
本文介绍了使用 Scikit-learn 对日期变量进行回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有 date 列(例如:2013-04-01)的 Pandas DataFrame,其数据类型为 datetime.date.当我在 X_train 中包含该列并尝试拟合回归模型时,出现错误 float() 参数必须是字符串或数字.删除 date 列避免了这个错误.

I have a Pandas DataFrame with a date column (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_train and try to fit the regression model, I get the error float() argument must be a string or a number. Removing the date column avoided this error.

在回归模型中考虑date的正确方法是什么?

What is the proper way to take the date into account in the regression model?

代码

data = sql.read_frame(...)
X_train = data.drop('y', axis=1)
y_train = data.y

rf = RandomForestRegressor().fit(X_train, y_train)

错误

TypeError                                 Traceback (most recent call last)
<ipython-input-35-8bf6fc450402> in <module>()
----> 2 rf = RandomForestRegressor().fit(X_train, y_train)

C:Python27libsite-packagessklearnensembleforest.pyc in fit(self, X, y, sample_weight)
    292                 X.ndim != 2 or
    293                 not X.flags.fortran):
--> 294             X = array2d(X, dtype=DTYPE, order="F")
    295 
    296         n_samples, self.n_features_ = X.shape

C:Python27libsite-packagessklearnutilsvalidation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

C:Python27libsite-packages
umpycore
umeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

TypeError: float() argument must be a string or a number

推荐答案

最好的方法是将日期分解为一组使用 1-of-K 编码以布尔形式编码的分类特征(例如由 DictVectorizer).以下是可以从日期中提取的一些特征:

The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

  • 一天中的小时(24 个布尔特征)
  • 星期几(7 个布尔特征)
  • 一个月中的某一天(最多 31 个布尔特征)
  • 一年中的月份(12 个布尔特征)
  • year(与数据集中不同年份一样多的布尔特征)...

这应该可以识别典型人类生命周期中周期性事件的线性依赖性.

That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

此外,您还可以将日期提取为单个浮点数:将每个日期转换为自训练集的最小日期以来的天数,然后除以最大日期与最大日期之间的天数之差最小日期.该数字特征应该可以识别事件日期输出之间的长期趋势:例如回归问题中的线性斜率,以更好地预测未来年份的演变,但不能使用年份特征的布尔分类变量进行编码.

Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.

这篇关于使用 Scikit-learn 对日期变量进行回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆