如何避免在KNN模型中浪费时间? [英] How do I avoid time leakage in my KNN model?

查看:217
本文介绍了如何避免在KNN模型中浪费时间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在建立一个KNN模型来预测房价.我将仔细研究我的数据和模型,然后是我的问题.

I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.

数据-

# A tibble: 81,334 x 4
   latitude longitude close_date          close_price
      <dbl>     <dbl> <dttm>                    <dbl>
 1     36.4     -98.7 2014-08-05 06:34:00     147504.
 2     36.6     -97.9 2014-08-12 23:48:00     137401.
 3     36.6     -97.9 2014-08-09 04:00:40     239105.

模型-

library(caret)
training.samples <- data$close_price %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

model <- train(
  close_price~ ., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

我的问题是时间浪费.我正在使用后来关闭的其他房屋对房屋进行预测,在现实世界中,我不应该获得该信息.

My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.

我想对模型应用规则,即对于每个值y,仅使用在该y的房屋之前关闭的房屋.我知道我可以在特定日期拆分测试数据和火车数据,但这并不能完全做到这一点.

I want to apply a rule to the model that says, for each value y, only use houses that closed before the house for that y. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.

是否可以在caret或其他knn库(例如classkknn)中防止这种时间泄漏?

Is it possible to prevent this time leakage, either in caret or other libraries for knn (like class and kknn)?

推荐答案

caret中,createTimeSlices实现了适用于时间序列的交叉验证的一种变体(通过滚动预测原点来避免时间泄漏). 文档位于此处.

In caret, createTimeSlices implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin). Documentation is here.

在您的情况下,根据您的确切需求,您可以使用类似的方法进行正确的交叉验证:

In your case, depending on your precise needs, you could use something like this for a proper cross-validation:

your_data <- your_data %>% arrange(close_date)

tr_ctrl <- createTimeSlices(
  your_data$close_price, 
  initialWindow  = 10, 
  horizon = 1,
  fixedWindow = FALSE)

model <- train(
  close_price~ ., data = your_data, method = "knn",
  trControl = tr_ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

如果您在日期中有联系,并且希望在测试和训练集中的同一天完成交易,则可以在train中使用tr_ctrl之前对其进行修复:

if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl before using it in train:

filter_train <- function(i_tr, i_te) {
  d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
  d_te <- as_date(your_data$close_date[i_te])
  tr_is_ok <- d_tr < min(d_te)

  i_tr[tr_is_ok]
}

tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)

这篇关于如何避免在KNN模型中浪费时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆