图像分类的数据混洗 [英] Data shuffling for Image Classification

查看:28
本文介绍了图像分类的数据混洗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想开发一个 CNN 模型来识别美国手语中的 24 个手势.我创建了一个自定义的images07">images07.

I want to develop a CNN model to identify 24 hand signs in American Sign Language. I created a custom dataset that contains 3000 images for each hand sign i.e. 72000 images in the entire dataset.

为了训练模型,我将使用 80-20 个数据集拆分(训练集中 2400 张图像/手势,验证集中 600 张图像/手势).

For training the model, I would be using 80-20 dataset split (2400 images/hand sign in the training set and 600 images/hand sign in the validation set).

我的问题是:创建数据集时我应该随机打乱图像吗?为什么?

My question is: Should I randomly shuffle the images when creating the dataset? And Why?

根据我之前的经验,它导致验证损失低于训练损失,验证准确度高于训练准确度.检查此链接.

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

推荐答案

数据的随机混洗是所有机器学习管道中的标准程序,图像分类也不例外;其目的是在数据准备期间打破可能的偏见 - 例如首先将所有猫图像放在猫/狗分类数据集中,然后是狗图像.

Random shuffling of data is a standard procedure in all machine learning pipelines, and image classification is not an exception; its purpose is to break possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.

以著名的鸢尾花数据集为例:

Take for example the famous iris dataset:

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

你可以清楚地看到,数据集的准备方式是前50个样本都是标签0,接下来的50个样本都是标签1,以及标签 2 的最后 50 个.尝试在这样的数据集中执行 5 折交叉验证而不进行混洗,您会发现大部分折叠仅包含一个标签;尝试 3 折简历,所有您的折叠将只包含一个标签....顺便说一句,这不仅仅是一种理论上的可能性,它具有 实际发生了.

As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.

即使不存在这种偏见,洗牌也永远不会受到伤害,所以我们总是这样做只是为了安全起见(你永远不知道......).

Even if no such bias exists, shuffling never hurts, so we do it always just to be on the safe side (you never know...).

根据我之前的经验,它导致验证损失低于训练损失,验证准确度高于训练准确度.检查此链接.

Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.

如那里的答案所述,这极不可能是由于洗牌造成的.数据洗牌并不是什么复杂的事情——本质上,它只是等价于一副牌的洗牌;可能有一次你坚持更好"的洗牌,随后你最终拿到了同花顺手,但这显然不是因为牌的更好"洗牌.

As noted in the answer there, it is highly unlikely that this was due to shuffling. Data shuffling is not anything sophisticated - essentially, it is just the equivalent of shuffling a deck of cards; it may have happened once that you insisted on "better" shuffling and subsequently you ended up with a straight flush hand, but obviously this was not due to the "better" shuffling of the cards.

这篇关于图像分类的数据混洗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆