数据改组以进行图像分类 [英] Data shuffling for Image Classification

查看:80
本文介绍了数据改组以进行图像分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想开发一种CNN模型,以识别24种美国手语的手势.我创建了一个自定义数据集,其中每个手势包含3000张图像,即整个数据集中的72000张图像

I want to develop a CNN model to identify 24 hand signs in American Sign Language. I created a custom dataset that contains 3000 images for each hand sign i.e. 72000 images in the entire dataset.

对于训练模型,我将使用80-20个数据集拆分(训练集中2400张图像/手势,在验证集中600张图像/手势).

For training the model, I would be using 80-20 dataset split (2400 images/hand sign in the training set and 600 images/hand sign in the validation set).

我的问题是: 创建数据集时,我应该随机对图像进行随机播放吗?为什么?

My question is: Should I randomly shuffle the images when creating the dataset? And Why?

根据我以前的经验,它导致验证损失低于培训损失,验证准确性高于培训准确性. 选中此链接.

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

推荐答案

在所有机器学习管道中,对数据进行随机混洗是一个标准过程,图像分类也不例外.其目的是打破数据准备过程中可能出现的偏差-例如将所有猫的图像先放置,然后将猫的图像放到猫/狗分类数据集中.

Random shuffling of data is a standard procedure in all machine learning pipelines, and image classification is not an exception; its purpose is to break possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

例如著名的虹膜数据集:

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

您可以清楚地看到,数据集的准备方式是前50个样本全部是标签0,接下来的50个标签1和最后50个标签.尝试在此类数据集中执行5折交叉验证而不进行混洗,您会发现大部分折痕中只包含一个标签;尝试3折CV,所有 您的折页将只包含一个标签. Bad ...顺便说一句,这不仅是理论上的可能性,它还具有

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

即使不存在这种偏见,改组也不会造成伤害,因此我们始终只是为了安全起见(您永远不会知道...).

Even if no such bias exists, shuffling never hurts, so we do it always just to be on the safe side (you never know...).

根据我以前的经验,它导致验证损失低于培训损失,验证准确性高于培训准确性. 选中此链接.

如此处答案所述,这不太可能是由于改组.数据改组并不是什么复杂的事情-本质上,它只不过是改组一副纸牌;一旦您坚持更好"的洗牌,然后您最终得到同花顺牌,可能就已经发生了,但这显然不是由于卡片的更好"的洗牌.

As noted in the answer there, it is highly unlikely that this was due to shuffling. Data shuffling is not anything sophisticated - essentially, it is just the equivalent of shuffling a deck of cards; it may have happened once that you insisted on "better" shuffling and subsequently you ended up with a straight flush hand, but obviously this was not due to the "better" shuffling of the cards.

这篇关于数据改组以进行图像分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆