如何避免在KNN模型中浪费时间? [英] How do I avoid time leakage in my KNN model?
问题描述
我正在建立一个KNN模型来预测房价.我将仔细研究我的数据和模型,然后是我的问题.
I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.
数据-
# A tibble: 81,334 x 4
latitude longitude close_date close_price
<dbl> <dbl> <dttm> <dbl>
1 36.4 -98.7 2014-08-05 06:34:00 147504.
2 36.6 -97.9 2014-08-12 23:48:00 137401.
3 36.6 -97.9 2014-08-09 04:00:40 239105.
模型-
library(caret)
training.samples <- data$close_price %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
model <- train(
close_price~ ., data = train.data, method = "knn",
trControl = trainControl("cv", number = 10),
preProcess = c("center", "scale"),
tuneLength = 10
)
我的问题是时间浪费.我正在使用后来关闭的其他房屋对房屋进行预测,在现实世界中,我不应该获得该信息.
My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.
我想对模型应用规则,即对于每个值y
,仅使用在该y
的房屋之前关闭的房屋.我知道我可以在特定日期拆分测试数据和火车数据,但这并不能完全做到这一点.
I want to apply a rule to the model that says, for each value y
, only use houses that closed before the house for that y
. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.
是否可以在caret
或其他knn库(例如class
和kknn
)中防止这种时间泄漏?
Is it possible to prevent this time leakage, either in caret
or other libraries for knn (like class
and kknn
)?
推荐答案
在caret
中,createTimeSlices
实现了适用于时间序列的交叉验证的一种变体(通过滚动预测原点来避免时间泄漏).
文档位于此处.
In caret
, createTimeSlices
implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin).
Documentation is here.
在您的情况下,根据您的确切需求,您可以使用类似的方法进行正确的交叉验证:
In your case, depending on your precise needs, you could use something like this for a proper cross-validation:
your_data <- your_data %>% arrange(close_date)
tr_ctrl <- createTimeSlices(
your_data$close_price,
initialWindow = 10,
horizon = 1,
fixedWindow = FALSE)
model <- train(
close_price~ ., data = your_data, method = "knn",
trControl = tr_ctrl,
preProcess = c("center", "scale"),
tuneLength = 10
)
如果您在日期中有联系,并且希望在测试和训练集中的同一天完成交易,则可以在train
中使用tr_ctrl
之前对其进行修复:
if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl
before using it in train
:
filter_train <- function(i_tr, i_te) {
d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
d_te <- as_date(your_data$close_date[i_te])
tr_is_ok <- d_tr < min(d_te)
i_tr[tr_is_ok]
}
tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)
这篇关于如何避免在KNN模型中浪费时间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!