在R中使用“神经网络"时出现意外的输出 [英] Unexpected output while using 'neuralnet' in R
问题描述
我正在使用R
的neuralnet
包来预测手写数字. MNIST数据库用于该算法的训练和测试.这是我使用的R
代码:
I'm using the neuralnet
package of R
for the prediction of hand written digits. MNIST database is being used for training and testing of this algorithm. Here is the R
code I used:
# Importing the data into R
path <- "path_to_data_folder/MNIST_database_of_handwritten_digits/" # Data can be downloaded from: http://yann.lecun.com/exdb/mnist/
to.read = file(paste0(path, "train-images-idx3-ubyte"), "rb")
to.read_Label = file(paste0(path, "train-labels-idx1-ubyte"), "rb")
magicNumber <- readBin(to.read, integer(), n=1, endian="big")
magicNumber_Label <- readBin(to.read_Label, integer(), n=1, endian="big")
numberOfImages <- readBin(to.read, integer(), n=1, endian="big")
numberOfImages_Label <- readBin(to.read_Label, integer(), n=1, endian="big")
rowPixels <- readBin(to.read, integer(), n=1, endian="big")
columnPixels <- readBin(to.read, integer(), n=1, endian="big")
# image(1:rowPixels, 1:columnPixels, matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1], col=gray((0:255)/255))
trainDigits <- NULL
trainDigits <- vector(mode="list", length=numberOfImages)
for(i in 1:numberOfImages)
trainDigits[[i]] <- as.vector(matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1])
trainDigits <- t(data.frame(trainDigits)) # Takes a minute
trainDigits <- data.frame(trainDigits, row.names=NULL)
# i <- 1 # Specify the image number to visualize the image
# image(1:rowPixels, 1:columnPixels, matrix(trainDigits[i,], rowPixels, columnPixels), col=gray((0:255)/255))
trainDigits_Label <- NULL
for(i in 1:numberOfImages_Label)
trainDigits_Label <- c(trainDigits_Label, readBin(to.read_Label, integer(), n=1, size=1, endian="big"))
# appending the labels to the training data
trainDigits <- cbind(trainDigits, trainDigits_Label)
#################### Modelling ####################
library(neuralnet)
# Considering only 500 rows for training due to time and memory constraints
myNnet <- neuralnet(formula = as.formula(paste0("trainDigits_Label ~ ", paste0("X",1:(ncol(trainDigits)-1), collapse="+"))),
data = trainDigits[1:500,], hidden = 10, algorithm='rprop+', learningrate=0.01)
#################### Test Data ####################
to.read_test = file(paste0(path, "t10k-images-idx3-ubyte"), "rb")
to.read_Label_test = file(paste0(path, "t10k-labels-idx1-ubyte"), "rb")
magicNumber <- readBin(to.read_test, integer(), n=1, endian="big")
magicNumber_Label <- readBin(to.read_Label_test, integer(), n=1, endian="big")
numberOfImages_test <- readBin(to.read_test, integer(), n=1, endian="big")
numberOfImages_Label_test <- readBin(to.read_Label_test, integer(), n=1, endian="big")
rowPixels <- readBin(to.read_test, integer(), n=1, endian="big")
columnPixels <- readBin(to.read_test, integer(), n=1, endian="big")
testDigits <- NULL
testDigits <- vector(mode="list", length=numberOfImages_test)
for(i in 1:numberOfImages_test)
testDigits[[i]] <- as.vector(matrix(readBin(to.read_test, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1])
testDigits <- t(data.frame(testDigits)) # Takes a minute
testDigits <- data.frame(testDigits, row.names=NULL)
testDigits_Label <- NULL
for(i in 1:numberOfImages_Label_test)
testDigits_Label <- c(testDigits_Label, readBin(to.read_Label_test, integer(), n=1, size=1, endian="big"))
#################### 'neuralnet' Predictions ####################
predictOut <- compute(myNnet, testDigits)
table(round(predictOut$net.result), testDigits_Label)
#################### Random Forest ####################
# Cross-validating NN results with Random Forest
library(randomForest)
myRF <- randomForest(x=trainDigits[,-ncol(trainDigits)], y=as.factor(trainDigits_Label), ntree=100)
predRF <- predict(myRF, newdata=testDigits)
table(predRF, testDigits_Label) # Confusion Matrix
sum(diag(table(predRF, testDigits_Label)))/sum(table(predRF, testDigits_Label)) # % of correct predictions
有60,000个训练图像(28 * 28像素图像),并且数字0到9在整个数据集中平均分布(几乎).与上面的建模"部分只使用500张图像不同,我使用了整个训练数据集来训练myNnet
模型(28 * 28 = 784输入和10输出),然后预测了10,000张图像的输出.测试数据集. (由于内存限制,我在隐藏层中仅使用了10个神经元.)
There are 60,000 training images (28*28 pixel images) and the digits 0 to 9 are distributed (almost) equally among the entire dataset. Unlike in the 'modelling' part above where I used only 500 images, I used the entire training dataset to train a myNnet
model (28*28=784 inputs and 10 outputs) and then predicted the output of the 10,000 images in the test dataset. (I used only 10 neurons in the hidden layer due to memory constraints.)
我通过预测获得的结果很奇怪:输出是一种高斯分布,其中大多数时间预测为4,而预测从4变为0或9呈指数下降.您可以在下面看到混淆矩阵(我将输出四舍五入,因为它们不是整数):
The results I obtained with the prediction are weird: the output was kind of a Gaussian distribution where 4 was predicted most of the time and the predictions towards 0 or 9 from 4 decreased (kind of) exponentially. You can see the confusion matrix below (I rounded off the outputs since they were not integers):
> table(round(predictOut$net.result), testDigits_Label)
testDigits_Label
0 1 2 3 4 5 6 7 8 9
-2 1 1 4 1 1 3 0 4 1 2
-1 8 17 12 9 7 8 8 12 7 10
0 38 50 44 45 35 28 36 40 30 39
1 77 105 86 80 71 69 68 75 67 77
2 116 163 126 129 101 97 111 101 99 117
3 159 205 196 174 142 140 153 159 168 130
4 216 223 212 183 178 170 177 169 181 196
5 159 188 150 183 183 157 174 176 172 155
6 119 111 129 125 143 124 144 147 129 149
7 59 53 52 60 74 52 51 91 76 77
8 22 14 18 14 32 36 28 38 35 41
9 6 5 3 7 15 8 8 16 9 16
我认为我的方法一定有问题,因此我尝试使用R
的randomForest
软件包进行预测.但是,randomForest
可以很好地提供95%以上的精度.这是randomForest
个预测的混淆矩阵:
I thought there must be something wrong my approach, so I tried prediction using the randomForest
package of R
. But, randomForest
worked fine giving an accuracy of more than 95%. Here is the confusion matrix of randomForest
predictions:
> table(predRF, testDigits_Label)
testDigits_Label
predRF 0 1 2 3 4 5 6 7 8 9
0 967 0 6 1 1 7 11 2 5 5
1 0 1123 0 0 0 1 3 7 0 5
2 1 2 974 9 3 1 3 25 4 2
3 0 3 5 963 0 21 0 0 9 10
4 0 0 12 0 940 1 4 2 7 15
5 4 0 2 16 0 832 6 0 11 4
6 6 5 5 0 7 11 929 0 3 2
7 1 1 14 7 2 2 0 979 4 6
8 1 1 12 7 5 11 2 1 917 10
9 0 0 2 7 24 5 0 12 14 950
-
问题1:那么,任何人都可以解释一下为什么
neuralnet
在此数据集中有这种奇怪的行为吗? (顺便说一句,当我检查时,neuralnet
在iris
数据集上工作正常).Question 1: So, can anyone please explain me why is
neuralnet
having this strange behaviour with this dataset? (BTW,neuralnet
was working fine withiris
dataset when I checked).- 我认为我理解使用
neuralnet
时输出中出现高斯分布的原因.使用neuralnet
时,只有一个输出节点(或者是神经元?),而不是每个输出类(这里为10类)的节点.因此,在计算用于反向传播的 delta 时,该算法会计算预期输出"与计算输出"之间的差异,在所有实例的总和中,对于输出为4或5.因此,权重将在反向传播期间进行调整,以使输出错误被最小化.此可能"是neuralnet
给出高斯类型输出的原因.
- I think I understood the reason for the Gaussian kind of distribution in the output when
neuralnet
is used. There is only one output node (or is it neuron?) instead of a node for each output class (which is 10 classes here) whenneuralnet
is used. So, while calculating the delta for back-propagation, the algorithm computes the difference of the 'expected output' to 'calculated output', which on aggregating for all the instances will be least for those instances where the output is either 4 or 5. So, the weights will be adjusted during back-propagation in such way that the output error is minimized. This 'might' be the reason for the Gaussian kind of output given byneuralnet
.
问题2:而且我还想知道如何纠正
neuralnet
的这种行为,并获得与randomForest
结果相同的预测.Question 2: And also I want to know how to rectify this behaviour of
neuralnet
and get predictions on par with therandomForest
results.推荐答案
一些初步建议,您可以像这样更有效地加载数据:
Some preliminary advice, you can load your data in like this a little more efficiently:
# Read in data. trainDigits <- replicate(numberOfImages,c(matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"),rowPixels,columnPixels)[,columnPixels:1])) trainDigits <- data.frame(t(trainDigits),row.names=NULL) trainDigits_Label<-replicate(numberOfImages,readBin(to.read_Label, integer(), n=1, size=1, endian="big"))
您的第一个问题是您尚未为
neuralnet
指定多类预测.您正在做的是预测一个从0到9的实数.这就是为什么只有一个输出而不是10个预测的原因.Your first problem is that you have not specified a multiclass prediction to
neuralnet
. What you were doing was predicting a real number, from 0 to 9. That is why there was only one output, instead of 10 predictions.如果您查看
?neuralnet
,则有一个多类预测的示例;您必须将每个类放在单独的变量中,并将其放在formula
的左侧.其他软件包,例如nnet
,将自动检测factor
并为您执行此操作.您可以使用classInd
函数将一个因子分解为多个变量:If you look in
?neuralnet
there is an example of a multiclass prediction; you must put each class in a separate variable, and put it on the left side of theformula
. Other packages, likennet
, will automatically detect afactor
and do this for you. You can use theclassInd
function to split a factor into multiple variables:# appending the labels to the training data output <- class.ind(trainDigits_Label) colnames(output)<-paste0('out.',colnames(output)) output.names<-colnames(output) input.names<-colnames(trainDigits) trainDigits<-cbind(output,trainDigits)
现在您可以将公式粘贴在一起:
Now you can paste together a formula:
# Considering only 500 rows trainsize=500 # neuralnet:::varify.variables (sic) does not pass "data" when calling "terms". # If it did, you wouldn't have to construct the formula like this. library(neuralnet) myNnet <- neuralnet(formula = paste(paste(output.names,collapse='+'),'~', paste(input.names,collapse='+')), data = trainDigits[1:trainsize,], hidden = 10, algorithm='rprop+', learningrate=0.01, rep=1)
校正仍然不能使神经网络表现良好.为了了解神经网络的性能有多糟糕,请看一下神经网络在训练数据上的表现.应该很好,因为它之前已经看过所有这些数据:
The correction still doesn't make the neural network perform well. To get an idea of how bad the neural network is doing look at how it performs on the training data. It should be pretty good, because it has seen all this data before:
# Accuracy on training data res<-compute(myNnet,trainDigits[1:trainsize,input.names]) picks<-(0:9)[apply(res$net.result,1,which.max)] prop.table(table(trainDigits_Label[1:trainsize] == picks)) # FALSE TRUE # 0.376 0.624
训练数据的准确性高达62%.如您所料,它在其余数据上的表现略高于随机数:
An accuracy of 62% is terrible on training data. As you might expect, it performs at barely above random on the rest of the data:
# Accuracy on test data res<-compute(myNnet,trainDigits[(trainsize+1):60000,input.names]) picks<-(0:9)[apply(res$net.result,1,which.max)] prop.table(table(trainDigits_Label[(trainsize+1):60000] == picks)) # FALSE TRUE # 0.8612268908 0.1387731092 # 14% accuracy
随机森林在使用完全相同的数据时表现出色.有一个很好的理由使它最近变得如此流行.
Random forest does amazingly well with the exact same data. There is a good reason why it has become so popular lately.
trainsize=500 library(randomForest) myRF <- randomForest(trainDigits_Label~., data=data.frame(trainDigits_Label=as.factor(trainDigits_Label), trainDigits[input.names])[1:trainsize,], ntree=100) # Train p <- as.numeric(as.character(predict(myRF))) prop.table(table(trainDigits_Label[1:trainsize]==p)) # Accuracy: 79% # Test p <- as.numeric(as.character(predict(myRF,trainDigits[(trainsize+1):60000,]))) prop.table(table(trainDigits_Label[(trainsize+1):60000]==p)) # Accuracy: 76%
因此,对于您的第二个问题,我的反问题是:为什么您希望神经网络的表现与随机森林一样好?它们可能在结构上有一些模糊的相似之处,但拟合过程却大不相同.我想您可以深入研究神经网络中的节点,并将其与随机森林模型中最重要的变量进行比较.但是,在这一点上,它更像是一个统计问题,而不是编程问题.
So, for your second question, my counter question is: why would you expect the neural network to do as well as random forest? They might have some vague structural similarities, but the fitting process is quite different. I guess you could pore over the nodes in the neural network and compare them to the most important variables in the random forest model. But, at this point, it is more of a statistical question than a programming one.
这篇关于在R中使用“神经网络"时出现意外的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 我认为我理解使用