在R中使用“神经网络"时出现意外的输出 [英] Unexpected output while using 'neuralnet' in R

查看:91
本文介绍了在R中使用“神经网络"时出现意外的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Rneuralnet包来预测手写数字. MNIST数据库用于该算法的训练和测试.这是我使用的R代码:

I'm using the neuralnet package of R for the prediction of hand written digits. MNIST database is being used for training and testing of this algorithm. Here is the R code I used:

# Importing the data into R
path <- "path_to_data_folder/MNIST_database_of_handwritten_digits/"  # Data can be downloaded from: http://yann.lecun.com/exdb/mnist/
to.read = file(paste0(path, "train-images-idx3-ubyte"), "rb")
to.read_Label = file(paste0(path, "train-labels-idx1-ubyte"), "rb")
magicNumber <- readBin(to.read, integer(), n=1, endian="big")
magicNumber_Label <- readBin(to.read_Label, integer(), n=1, endian="big")
numberOfImages <- readBin(to.read, integer(), n=1, endian="big")
numberOfImages_Label <- readBin(to.read_Label, integer(), n=1, endian="big")
rowPixels <- readBin(to.read, integer(), n=1, endian="big")
columnPixels <- readBin(to.read, integer(), n=1, endian="big")

# image(1:rowPixels, 1:columnPixels, matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1], col=gray((0:255)/255))

trainDigits <- NULL
trainDigits <- vector(mode="list", length=numberOfImages)
for(i in 1:numberOfImages)
  trainDigits[[i]] <- as.vector(matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1])

trainDigits <- t(data.frame(trainDigits))  # Takes a minute
trainDigits <- data.frame(trainDigits, row.names=NULL)

# i <- 1  # Specify the image number to visualize the image
# image(1:rowPixels, 1:columnPixels, matrix(trainDigits[i,], rowPixels, columnPixels), col=gray((0:255)/255))

trainDigits_Label <- NULL
for(i in 1:numberOfImages_Label)
  trainDigits_Label <- c(trainDigits_Label, readBin(to.read_Label, integer(), n=1, size=1, endian="big"))

# appending the labels to the training data
trainDigits <- cbind(trainDigits, trainDigits_Label)

#################### Modelling ####################

library(neuralnet)
# Considering only 500 rows for training due to time and memory constraints
myNnet <- neuralnet(formula = as.formula(paste0("trainDigits_Label ~ ", paste0("X",1:(ncol(trainDigits)-1), collapse="+"))),
                                data = trainDigits[1:500,], hidden = 10, algorithm='rprop+', learningrate=0.01)

#################### Test Data ####################

to.read_test = file(paste0(path, "t10k-images-idx3-ubyte"), "rb")
to.read_Label_test = file(paste0(path, "t10k-labels-idx1-ubyte"), "rb")
magicNumber <- readBin(to.read_test, integer(), n=1, endian="big")
magicNumber_Label <- readBin(to.read_Label_test, integer(), n=1, endian="big")
numberOfImages_test <- readBin(to.read_test, integer(), n=1, endian="big")
numberOfImages_Label_test <- readBin(to.read_Label_test, integer(), n=1, endian="big")
rowPixels <- readBin(to.read_test, integer(), n=1, endian="big")
columnPixels <- readBin(to.read_test, integer(), n=1, endian="big")

testDigits <- NULL
testDigits <- vector(mode="list", length=numberOfImages_test)
for(i in 1:numberOfImages_test)
  testDigits[[i]] <- as.vector(matrix(readBin(to.read_test, integer(), n=(rowPixels*columnPixels), size=1, endian="big"), rowPixels, columnPixels)[,columnPixels:1])

testDigits <- t(data.frame(testDigits))  # Takes a minute
testDigits <- data.frame(testDigits, row.names=NULL)

testDigits_Label <- NULL
for(i in 1:numberOfImages_Label_test)
  testDigits_Label <- c(testDigits_Label, readBin(to.read_Label_test, integer(), n=1, size=1, endian="big"))

#################### 'neuralnet' Predictions ####################

predictOut <- compute(myNnet, testDigits)
table(round(predictOut$net.result), testDigits_Label)

#################### Random Forest ####################
# Cross-validating NN results with Random Forest

library(randomForest)
myRF <- randomForest(x=trainDigits[,-ncol(trainDigits)], y=as.factor(trainDigits_Label), ntree=100)

predRF <- predict(myRF, newdata=testDigits)
table(predRF, testDigits_Label)  # Confusion Matrix
sum(diag(table(predRF, testDigits_Label)))/sum(table(predRF, testDigits_Label))  # % of correct predictions

有60,000个训练图像(28 * 28像素图像),并且数字0到9在整个数据集中平均分布(几乎).与上面的建模"部分只使用500张图像不同,我使用了整个训练数据集来训练myNnet模型(28 * 28 = 784输入和10输出),然后预测了10,000张图像的输出.测试数据集. (由于内存限制,我在隐藏层中仅使用了10个神经元.)

There are 60,000 training images (28*28 pixel images) and the digits 0 to 9 are distributed (almost) equally among the entire dataset. Unlike in the 'modelling' part above where I used only 500 images, I used the entire training dataset to train a myNnet model (28*28=784 inputs and 10 outputs) and then predicted the output of the 10,000 images in the test dataset. (I used only 10 neurons in the hidden layer due to memory constraints.)

我通过预测获得的结果很奇怪:输出是一种高斯分布,其中大多数时间预测为4,而预测从4变为0或9呈指数下降.您可以在下面看到混淆矩阵(我将输出四舍五入,因为它们不是整数):

The results I obtained with the prediction are weird: the output was kind of a Gaussian distribution where 4 was predicted most of the time and the predictions towards 0 or 9 from 4 decreased (kind of) exponentially. You can see the confusion matrix below (I rounded off the outputs since they were not integers):

> table(round(predictOut$net.result), testDigits_Label)
    testDigits_Label
       0   1   2   3   4   5   6   7   8   9
  -2   1   1   4   1   1   3   0   4   1   2
  -1   8  17  12   9   7   8   8  12   7  10
  0   38  50  44  45  35  28  36  40  30  39
  1   77 105  86  80  71  69  68  75  67  77
  2  116 163 126 129 101  97 111 101  99 117
  3  159 205 196 174 142 140 153 159 168 130
  4  216 223 212 183 178 170 177 169 181 196
  5  159 188 150 183 183 157 174 176 172 155
  6  119 111 129 125 143 124 144 147 129 149
  7   59  53  52  60  74  52  51  91  76  77
  8   22  14  18  14  32  36  28  38  35  41
  9    6   5   3   7  15   8   8  16   9  16

我认为我的方法一定有问题,因此我尝试使用RrandomForest软件包进行预测.但是,randomForest可以很好地提供95%以上的精度.这是randomForest个预测的混淆矩阵:

I thought there must be something wrong my approach, so I tried prediction using the randomForest package of R. But, randomForest worked fine giving an accuracy of more than 95%. Here is the confusion matrix of randomForest predictions:

> table(predRF, testDigits_Label)
      testDigits_Label
predRF    0    1    2    3    4    5    6    7    8    9
     0  967    0    6    1    1    7   11    2    5    5
     1    0 1123    0    0    0    1    3    7    0    5
     2    1    2  974    9    3    1    3   25    4    2
     3    0    3    5  963    0   21    0    0    9   10
     4    0    0   12    0  940    1    4    2    7   15
     5    4    0    2   16    0  832    6    0   11    4
     6    6    5    5    0    7   11  929    0    3    2
     7    1    1   14    7    2    2    0  979    4    6
     8    1    1   12    7    5   11    2    1  917   10
     9    0    0    2    7   24    5    0   12   14  950

  • 问题1:那么,任何人都可以解释一下为什么neuralnet在此数据集中有这种奇怪的行为吗? (顺便说一句,当我检查时,neuralnetiris数据集上工作正常).

    • Question 1: So, can anyone please explain me why is neuralnet having this strange behaviour with this dataset? (BTW, neuralnet was working fine with iris dataset when I checked).

      • 我认为我理解使用neuralnet时输出中出现高斯分布的原因.使用neuralnet时,只有一个输出节点(或者是神经元?),而不是每个输出类(这里为10类)的节点.因此,在计算用于反向传播的 delta 时,该算法会计算预期输出"与计算输出"之间的差异,在所有实例的总和中,对于输出为4或5.因此,权重将在反向传播期间进行调整,以使输出错误被最小化.此可能"是neuralnet给出高斯类型输出的原因.
      • I think I understood the reason for the Gaussian kind of distribution in the output when neuralnet is used. There is only one output node (or is it neuron?) instead of a node for each output class (which is 10 classes here) when neuralnet is used. So, while calculating the delta for back-propagation, the algorithm computes the difference of the 'expected output' to 'calculated output', which on aggregating for all the instances will be least for those instances where the output is either 4 or 5. So, the weights will be adjusted during back-propagation in such way that the output error is minimized. This 'might' be the reason for the Gaussian kind of output given by neuralnet.

      问题2:而且我还想知道如何纠正neuralnet的这种行为,并获得与randomForest结果相同的预测.

      Question 2: And also I want to know how to rectify this behaviour of neuralnet and get predictions on par with the randomForest results.

      推荐答案

      一些初步建议,您可以像这样更有效地加载数据:

      Some preliminary advice, you can load your data in like this a little more efficiently:

      # Read in data.
      trainDigits <- replicate(numberOfImages,c(matrix(readBin(to.read, integer(), n=(rowPixels*columnPixels), size=1, endian="big"),rowPixels,columnPixels)[,columnPixels:1]))
      trainDigits <- data.frame(t(trainDigits),row.names=NULL)
      trainDigits_Label<-replicate(numberOfImages,readBin(to.read_Label, integer(), n=1, size=1, endian="big"))
      

      您的第一个问题是您尚未为neuralnet指定多类预测.您正在做的是预测一个从0到9的实数.这就是为什么只有一个输出而不是10个预测的原因.

      Your first problem is that you have not specified a multiclass prediction to neuralnet. What you were doing was predicting a real number, from 0 to 9. That is why there was only one output, instead of 10 predictions.

      如果您查看?neuralnet,则有一个多类预测的示例;您必须将每个类放在单独的变量中,并将其放在formula的左侧.其他软件包,例如nnet,将自动检测factor并为您执行此操作.您可以使用classInd函数将一个因子分解为多个变量:

      If you look in ?neuralnet there is an example of a multiclass prediction; you must put each class in a separate variable, and put it on the left side of the formula. Other packages, like nnet, will automatically detect a factor and do this for you. You can use the classInd function to split a factor into multiple variables:

      # appending the labels to the training data
      output <- class.ind(trainDigits_Label)
      colnames(output)<-paste0('out.',colnames(output))
      output.names<-colnames(output)
      input.names<-colnames(trainDigits)
      trainDigits<-cbind(output,trainDigits)
      

      现在您可以将公式粘贴在一起:

      Now you can paste together a formula:

      # Considering only 500 rows
      trainsize=500
      # neuralnet:::varify.variables (sic) does not pass "data" when calling "terms".
      # If it did, you wouldn't have to construct the formula like this.
      library(neuralnet)
      myNnet <- neuralnet(formula = paste(paste(output.names,collapse='+'),'~',
                                    paste(input.names,collapse='+')),
                          data = trainDigits[1:trainsize,],
                          hidden = 10, 
                          algorithm='rprop+', 
                          learningrate=0.01,
                          rep=1)
      

      校正仍然不能使神经网络表现良好.为了了解神经网络的性能有多糟糕,请看一下神经网络在训练数据上的表现.应该很好,因为它之前已经看过所有这些数据:

      The correction still doesn't make the neural network perform well. To get an idea of how bad the neural network is doing look at how it performs on the training data. It should be pretty good, because it has seen all this data before:

      # Accuracy on training data
      res<-compute(myNnet,trainDigits[1:trainsize,input.names])
      picks<-(0:9)[apply(res$net.result,1,which.max)]
      prop.table(table(trainDigits_Label[1:trainsize] == picks))
      # FALSE  TRUE 
      # 0.376 0.624 
      

      训练数据的准确性高达62%.如您所料,它在其余数据上的表现略高于随机数:

      An accuracy of 62% is terrible on training data. As you might expect, it performs at barely above random on the rest of the data:

      # Accuracy on test data
      res<-compute(myNnet,trainDigits[(trainsize+1):60000,input.names])
      picks<-(0:9)[apply(res$net.result,1,which.max)]
      prop.table(table(trainDigits_Label[(trainsize+1):60000] == picks))
      # FALSE         TRUE 
      # 0.8612268908 0.1387731092 
      # 14% accuracy
      

      随机森林在使用完全相同的数据时表现出色.有一个很好的理由使它最近变得如此流行.

      Random forest does amazingly well with the exact same data. There is a good reason why it has become so popular lately.

      trainsize=500
      library(randomForest)
      myRF <- randomForest(trainDigits_Label~.,
                           data=data.frame(trainDigits_Label=as.factor(trainDigits_Label),
                                           trainDigits[input.names])[1:trainsize,],
                           ntree=100)
      
      # Train
      p <- as.numeric(as.character(predict(myRF)))
      prop.table(table(trainDigits_Label[1:trainsize]==p))
      # Accuracy: 79%    
      
      # Test
      p <- as.numeric(as.character(predict(myRF,trainDigits[(trainsize+1):60000,])))
      prop.table(table(trainDigits_Label[(trainsize+1):60000]==p))
      # Accuracy: 76%
      

      因此,对于您的第二个问题,我的反问题是:为什么您希望神经网络的表现与随机森林一样好?它们可能在结构上有一些模糊的相似之处,但拟合过程却大不相同.我想您可以深入研究神经网络中的节点,并将其与随机森林模型中最重要的变量进行比较.但是,在这一点上,它更像是一个统计问题,而不是编程问题.

      So, for your second question, my counter question is: why would you expect the neural network to do as well as random forest? They might have some vague structural similarities, but the fitting process is quite different. I guess you could pore over the nodes in the neural network and compare them to the most important variables in the random forest model. But, at this point, it is more of a statistical question than a programming one.

      这篇关于在R中使用“神经网络"时出现意外的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆