当其中一些是因素时如何预处理特征? [英] How to preProcess features when some of them are factors?

查看:45
本文介绍了当其中一些是因素时如何预处理特征?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题与这个有关 关于使用 Caret 包时的分类数据(R 术语中的因素).我从链接的帖子中了解到,如果您使用公式界面",某些功能可能是因素,并且培训将正常进行.我的问题是如何使用 preProcess() 函数缩放数据?如果我尝试在具有某些列作为因子的数据框中执行此操作,则会收到此错误消息:

My question is related to this one regarding categorical data (factors in R terms) when using the Caret package. I understand from the linked post that if you use the "formula interface", some features can be factors and the training will work fine. My question is how can I scale the data with the preProcess() function? If I try and do it on a data frame with some columns as factors, I get this error message:

Error in preProcess.default(etitanic, method = c("center", "scale")) : 
  all columns of x must be numeric

在这里查看一些示例代码:

See here some sample code:

library(earth)
data(etitanic)

a <- preProcess(etitanic, method=c("center", "scale"))
b <- predict(etitanic, a)

谢谢.

推荐答案

这与您链接到的帖子实际上是同一个问题.preProcess 仅适用于数字数据,您有:

It is really the same issue as the post you link to. preProcess works only on numeric data and you have:

> str(etitanic)
'data.frame':   1046 obs. of  6 variables:
 $ pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ survived: int  1 1 0 0 0 1 1 0 1 0 ...
 $ sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ age     : num  29 0.917 2 30 25 ...
 $ sibsp   : int  0 1 1 1 1 0 1 0 2 0 ...
 $ parch   : int  0 2 2 2 2 0 0 0 0 0 ...

您不能按原样居中和缩放 pclasssex,因此需要将它们转换为虚拟变量.您可以使用 model.matrix 或插入符号的 dummyVars 来执行此操作:

You can't center and scale pclass or sex as-is so they need to be converted to dummy variables. You can use model.matrix or caret's dummyVars to do this:

 > new <- model.matrix(survived ~ . - 1, data = etitanic)
 > colnames(new)
 [1] "pclass1st" "pclass2nd" "pclass3rd" "sexmale"   "age"      
 [6] "sibsp"     "parch"  

-1 去掉了拦截.现在你可以在这个对象上运行 preProcess.

The -1 gets rid of the intercept. Now you can run preProcess on this object.

顺便说一句,使 preProcess 忽略非数字数据在我的待办事项"列表中,但它可能会导致人们不注意的错误.

btw making preProcess ignore non-numeric data is on my "to do" list but it might cause errors for people not paying attention.

最大

这篇关于当其中一些是因素时如何预处理特征?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆