Weka中的交叉验证 [英] Cross Validation in Weka

查看:440
本文介绍了Weka中的交叉验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从我阅读的内容中,我一直认为交叉验证是这样执行的:

I've always thought from what I read that cross validation is performed like this:

在k折交叉验证中,原始样本是随机的 分为k个子样本.在k个子样本中,有一个子样本 保留为用于测试模型的验证数据,并且 剩余的k − 1个子样本用作训练数据.这 然后将交叉验证过程重复k次(倍数), k个子样本中的每个样本都仅使用一次作为验证数据.这 来自折叠的k个结果可以被平均(或以其他方式组合) 产生一个单一的估算值

In k-fold cross-validation, the original sample is randomly partitioned into k subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation

因此,建立了k个模型,最后一个是这些模型的平均值. 在Weka指南中写道,每个模型总是使用所有数据集构建的.那么,Weka中的交叉验证如何工作?是根据所有数据构建的模型,交叉验证"是否意味着创建了k折,然后对每个折进行了评估,最终输出结果仅仅是折的平均结果?

So k models are built and the final one is the average of those. In Weka guide is written that each model is always built using ALL the data set. So how does cross validation in Weka work ? Is the model built from all data and the "cross-validation" means that k fold are created then each fold is evaluated on it and the final output results is simply the averaged result from folds?

推荐答案

那么,这又是场景:您有100个带标签的数据

So, here is the scenario again: you have 100 labeled data

使用训练集

  • weka将获取100个带有标签的数据
  • 它将应用算法从这100个数据中构建分类器
  • 将分类器再次应用于 这100个数据
  • 它为您提供了 分类器(应用于与之相同的100个数据 开发)
  • weka will take 100 labeled data
  • it will apply an algorithm to build a classifier from these 100 data
  • it applies that classifier AGAIN on these 100 data
  • it provides you with the performance of the classifier (applied to the same 100 data from which it was developed)

使用10折CV

  • Weka提取了100个标记数据

  • Weka takes 100 labeled data

它产生10个相等大小的集合.每组分为两组:用于训练的90个标记数据和用于测试的10个标记数据.

it produces 10 equal sized sets. Each set is divided into two groups: 90 labeled data are used for training and 10 labeled data are used for testing.

它使用来自90个标记数据的算法生成分类器,并将其应用于集合1的10个测试数据.

it produces a classifier with an algorithm from 90 labeled data and applies that on the 10 testing data for set 1.

对于第2组到第10组它执行相同的操作,并产生另外9个分类器

It does the same thing for set 2 to 10 and produces 9 more classifiers

它平均了10个相同大小(90个训练和10个测试)集生成的10个分类器的性能

it averages the performance of the 10 classifiers produced from 10 equal sized (90 training and 10 testing) sets

让我知道这是否回答了您的问题.

Let me know if that answers your question.

这篇关于Weka中的交叉验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆