扩展生产数据 [英] Scaling production data

查看:65
本文介绍了扩展生产数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,例如Data,它由分类和数字变量组成。清理它们之后,我仅使用

I have a dataset, say Data, which consists of categorical and numerical variables. After cleaning them, I have scaled only the numerical variables (guess catgorical must not be scaled) using

Data <- Data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

然后我使用

set.seed(123)
sample_size = floor(0.70*nrow(Data))
xyz <- sample(seq_len(nrow(Data)),size = sample_size)
Train_Set <- Join[xyz,]
Test_Set <- Join[-xyz,]

我已经使用游侠建立了一个分类模型,例如model_rang,使用Train_Set并使用Test_Set对其进行测试。

I have built a classification model using ranger, say model_rang, using Train_Set and tested on it using Test_Set.

如果新数据(例如new_data)在清理后到达生产环境,是否足以按上述方式进行扩展?我的意思是

If a new data, say new_data, arrives for production, after cleaning it, is it enough to scale it the above way? I mean

new_data <- new_data %>% dplyr::mutate_if(is.numeric, ~scale(.) %>% as.vector)

,然后使用它来预测结果,使用(有两类0和1和1感兴趣)

and then use it to predict the outcome using (there are two classes 0 and 1 and 1 is of interest)

probabilities <- as.data.frame(predict(model_rang, data = new_data, num.trees = 5000, type='response', verbose = TRUE)$predictions)
caret::confusionMatrix(table(max.col(probabilities) - 1,new_data$Class), positive='1')

规模是否按照Data中的规定正确完成,还是在生产数据中遗漏了任何重要内容?

Is the scale done properly as in Data or am I missing any crucial stuff in the production data?

或者,我必须分别缩放Train_Set并采用每个变量的标准偏差和相关的均值来缩放Test_Set,并且在生产过程中获得新数据时,旧的标准偏差和

Or, must I scale Train_Set separately and take the standard deviation of each variable and associated mean to scale Test_Set, and when new data during production arrives, the old standard deviation and mean from Train_Set be applied to every new data set?

推荐答案

缩放数据时,需要减去平均值并除以按标准偏差。新数据的均值和标准差可能与用于构建模型的(训练数据)中的均值和标准差不同。

When you scale the data, you subtract the mean off it and divide by the standard deviation. The mean and standard deviation in your new data might not be the same as that in the (training data) used to construct your model.

想象一下,在随机森林中,一个变量在0.555(按比例缩放的数据)处分割,现在在您的新数据中,标准偏差更低,低于0.555的值现在结束了,并将被分为不同的类。

Imagine in your random forest, one variable was split at 0.555 (scaled data) and now in your new data, the standard deviation is lower, values that would be below 0.555 are now over, and will be classified into a different class.

您可以做的一件事就是存储您所指向的帖子之类的属性:

One thing you can do is store the attributes like the post you pointed to:

set.seed(111)

data = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

num_cols = names(which(sapply(data,is.numeric)))

scale_params = attributes(scale(data[,num_cols]))[c("scaled:center","scaled:scale")]

newdata = data.frame(A=sample(letters[1:3],100,replace=TRUE),
B=runif(100),C=rnorm(100))

newdata[,num_cols] = scale(newdata[,num_cols],
center=scale_params[[1]],scale=scale_params[[2]])

这篇关于扩展生产数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆