错误“(下标)逻辑下标太长"使用 R 中 e1071 包中的 tune.svm [英] Error "(subscript) logical subscript too long" with tune.svm from e1071 package in R
问题描述
我正在尝试将 SVM 用于多类分类任务.
I am trying to use SVM for a multi-class classification task.
我有一个名为 df
的数据集,我将其分为训练集和测试集,代码如下:
I have a dataset called df
, which I divided into a training and a test set with the following code:
sample <- df[sample(nrow(df), 10000),] # take a random sample of 10,000 from dataset df
sample <- sample %>% arrange(Date) # arrange chronologically
train <- sample[1:8000,] # 80% of the df dataset
test <- sample[8001:10000,] # 20% of the df dataset
这是训练集的样子:
> str(train)
'data.frame': 8000 obs. of 45 variables:
$ Date : Date, format: "2008-01-01" "2008-01-01" "2008-01-02" ...
$ Weekday : chr "Tuesday" "Tuesday" "Wednesday" "Wednesday" ...
$ Season : Factor w/ 4 levels "Winter","Spring",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weekend : num 0 0 0 0 0 0 0 0 0 0 ...
$ Icao.type : Factor w/ 306 levels "A124","A225",..: 7 29 112 115 107 10 115 115 115 112 ...
$ Act.description : Factor w/ 389 levels "A300-600F","A330-200F",..: 9 29 161 162 150 13 162 162 162 161 ...
$ Arr.dep : Factor w/ 2 levels "A","D": 2 2 1 1 1 1 1 1 1 1 ...
$ MTOW : num 77 69 46 21 22 238 21 21 21 46 ...
$ Icao.wtc : chr "Medium" "Medium" "Medium" "Medium" ...
$ Wind.direc : int 104 104 82 82 93 93 93 132 132 132 ...
$ Wind.speed.vec : int 35 35 57 57 64 64 64 62 62 62 ...
$ Wind.speed.daily: int 35 35 58 58 65 65 65 63 63 63 ...
$ Wind.speed.max : int 60 60 70 70 80 80 80 90 90 90 ...
$ Wind.speed.min : int 20 20 40 40 50 50 50 50 50 50 ...
$ Wind.gust.max : int 100 100 120 120 130 130 130 140 140 140 ...
$ Temp.daily : int 24 24 -5 -5 4 4 4 34 34 34 ...
$ Temp.min : int -7 -7 -25 -25 -13 -13 -13 11 11 11 ...
$ Temp.max : int 50 50 16 16 13 13 13 55 55 55 ...
$ Temp.10.min : int -11 -11 -32 -32 -18 -18 -18 9 9 9 ...
$ Sun.dur : int 7 7 65 65 19 19 19 0 0 0 ...
$ Sun.dur.prct : int 9 9 83 83 24 24 24 0 0 0 ...
$ Radiation : int 173 173 390 390 213 213 213 108 108 108 ...
$ Precip.dur : int 0 0 0 0 0 0 0 5 5 5 ...
$ Precip.daily : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Precip.max : int 0 0 0 0 -1 -1 -1 2 2 2 ...
$ Sea.press.daily : int 10259 10259 10206 10206 10080 10080 10080 10063 10063 10063 ...
$ Sea.press.max : int 10276 10276 10248 10248 10132 10132 10132 10086 10086 10086 ...
$ Sea.press.min : int 10250 10250 10141 10141 10058 10058 10058 10001 10001 10001 ...
$ Visibility.min : int 1 1 40 40 43 43 43 58 58 58 ...
$ Visibility.max : int 59 59 75 75 66 66 66 65 65 65 ...
$ Cloud.daily : int 7 7 3 3 8 8 8 8 8 8 ...
$ Humidity.daily : int 96 96 86 86 77 77 77 82 82 82 ...
$ Humidity.max : int 99 99 92 92 92 92 92 90 90 90 ...
$ Humidity.min : int 91 91 74 74 71 71 71 76 76 76 ...
$ Evapo : int 2 2 4 4 2 2 2 1 1 1 ...
$ Wind.discrete : chr "South East" "South East" "North East" "North East" ...
$ Vmc.imc : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ Beaufort : num 3 3 4 4 4 4 4 4 4 4 ...
$ Main.A : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.B : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.K : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.O : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.P : num 0 0 0 0 0 0 0 0 0 0 ...
$ Main.Z : num 0 0 0 0 0 0 0 0 0 0 ...
$ Runway : Factor w/ 13 levels "04","06","09",..: 3 8 2 2 2 6 2 6 6 6 ...
然后,我尝试使用以下代码调整 SVM 参数:
Then, I try to tune the SVM parameters with the following code:
library(e1071)
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
虽然此代码在过去有效,但现在出现以下错误:
While this code has worked in the past, it now gives me the following error:
Error in newdata[, object$scaled, drop = FALSE] :
(subscript) logical subscript too long
我能想到的唯一改变的是数据集 train
中的行,因为运行第一个代码块意味着随机抽取 10,000 个样本(数据集 df
,包含 350 万行).
The only thing I can think of that has changed is the rows in the dataset train
, as running the first code block means taking a random sample of 10,000 (out of dataset df
, that contains 3.5 million rows).
有人知道我为什么会收到这个吗?
Does anyone know why I am getting this?
推荐答案
我认识到如果没有一个好的可重现的例子,这个问题很难解决.
I recognise that this question was rather hard to solve without a good reproducible example.
但是,我已经找到了我的问题的解决方案,并希望将其发布在这里供将来可能会寻找此问题的任何人使用.
However, I have found the solution to my problem and wanted to post it here for anyone who might be looking for this in the future.
运行相同的代码,但从训练集中选择列:
Running the same code, but with selected columns from the train set:
tuned <- tune.svm(Runway ~ ., data = train[,c(1:2, 45)], gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
给我绝对没有问题.我继续添加更多功能,直到重现错误.我发现功能 Vmc.imc
和 Icao.wtc
导致了错误,而且它们都是 chr
功能.使用以下代码:
gave me absolutely no problem. I continued adding more features until the error was reproduced. I found that the features Vmc.imc
and Icao.wtc
were causing the error and that they were both chr
features. Using the following code:
train$Vmc.imc <- as.factor(train$Vmc.imc)
train$Icao.wtc <- as.factor(train$Icao.wtc)
将它们变成因子然后重新运行
to change them into factors and then rerunning
tuned <- tune.svm(Runway ~ ., data = train, gamma = 10 ^ (-6:-1), cost = 10 ^ (-1:1))
解决了我的问题.
我不知道为什么其他 chr
功能,例如 Weekday
和 Wind.discrete
不会导致同样的问题.如果有人知道这个问题的答案,我很乐意知道.
I do not know why the other chr
features such as Weekday
and Wind.discrete
are not causing the same issue. If anyone knows the answer to this, I would be glad to find out.
这篇关于错误“(下标)逻辑下标太长"使用 R 中 e1071 包中的 tune.svm的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!