使用Scikit-learn使用Date变量进行回归 [英] Regression with Date variable using Scikit-learn

查看:142
本文介绍了使用Scikit-learn使用Date变量进行回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有dtype datetime.datedate列(例如:2013-04-01)的Pandas DataFrame.当我在X_train中包括该列并尝试拟合回归模型时,出现错误float() argument must be a string or a number.删除date列可避免此错误.

I have a Pandas DataFrame with a date column (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_train and try to fit the regression model, I get the error float() argument must be a string or a number. Removing the date column avoided this error.

在回归模型中考虑date的正确方法是什么?

What is the proper way to take the date into account in the regression model?

代码

data = sql.read_frame(...)
X_train = data.drop('y', axis=1)
y_train = data.y

rf = RandomForestRegressor().fit(X_train, y_train)

错误

TypeError                                 Traceback (most recent call last)
<ipython-input-35-8bf6fc450402> in <module>()
----> 2 rf = RandomForestRegressor().fit(X_train, y_train)

C:\Python27\lib\site-packages\sklearn\ensemble\forest.pyc in fit(self, X, y, sample_weight)
    292                 X.ndim != 2 or
    293                 not X.flags.fortran):
--> 294             X = array2d(X, dtype=DTYPE, order="F")
    295 
    296         n_samples, self.n_features_ = X.shape

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

C:\Python27\lib\site-packages\numpy\core\numeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

TypeError: float() argument must be a string or a number

推荐答案

最好的方法是将日期分解为使用1-of-K编码以布尔形式编码的一组分类特征. href ="http://scikit-learn.org/stable/modules/feature_extraction.html#loading-features-from-dicts" rel ="noreferrer"> DictVectorizer ).以下是一些可以从日期中提取的功能:

The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

  • 一天中的小时(24个布尔功能)
  • 一周中的一天(7个布尔功能)
  • 每月的第几天(最多31个布尔功能)
  • 一年中的月份(12个布尔功能)
  • year(与数据集中不同年份的布尔值一样多) ...
  • hour of the day (24 boolean features)
  • day of the week (7 boolean features)
  • day of the month (up to 31 boolean features)
  • month of the year (12 boolean features)
  • year (as many boolean features as they are different years in your dataset) ...

这应该使人们有可能识别出典型人类生命周期中周期性事件的线性依赖性.

That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

此外,您还可以提取单个浮动日期:将每个日期转换为自您的训练集的最小日期起的天数,然后除以最大日期与交易日之间的天数之差.最小日期.该数字功能应该可以确定事件日期输出之间的长期趋势:回归问题中的线性斜率,以更好地预测未来年份的演化,这些年份无法使用年份特征的布尔分类变量进行编码.

Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.

这篇关于使用Scikit-learn使用Date变量进行回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆