混排数据行时分类器的准确度为100% [英] 100% classifier accuracy when shuffling data rows

查看:104
本文介绍了混排数据行时分类器的准确度为100%的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究蘑菇分类数据集(在此处找到: https://www .kaggle.com/uciml/mushroom-classification )

I'm working on the mushroom classification data set (found here :https://www.kaggle.com/uciml/mushroom-classification)

我已经对数据进行了一些预处理(删除了冗余属性,将分类数据更改为数值),并且试图使用我的数据来训练分类器.

I've done some pre-processing on the data (removed redundant attributes, changed categorical data to numerical) and I'm trying to use my data to train classifiers.

每当我手动或使用train_test_split来对数据进行混洗时,我使用的所有模型(XGB,MLP,LinearSVC,决策树)都具有100%的准确性.每当我在未经混洗的数据上测试模型时,精度约为50%至85%.

Whenever I shuffle my data, either manually or by using train_test_split, all of the models which I use (XGB, MLP, LinearSVC, Decision Tree) have 100% accuracy. Whenever I test the models on unshuffled data the accuracy is around 50-85%.

这是我分割数据的方法:

Here are my methods for splitting the data:

x = testing.copy()
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)

并手动

x = testing.copy()
x = x.sample(frac=1)

testRatio = 0.3
testCount = int(len(x)*testRatio)

x_train = x[testCount:]
x_test = x[0:testCount]
y_train = y[testCount:]
y_test = y[0:testCount]

我在做某些事情完全错了吗?

Is there something I'm doing completely wrong and missing?

修改: 在对行进行改组和不进行改组的情况下拆分数据时,我唯一看到的区别是类的分布.

The only difference that I can see when splitting data with and without shuffling the rows is the distribution of the classes.

没有改组:

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=False)

print(y_test.value_counts())
print(y_train.value_counts())

结果:

0    1828
1     610
Name: class, dtype: int64
1    3598
0    2088
Name: class, dtype: int64

改组时:

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.3, shuffle=True)

print(y_test.value_counts())
print(y_train.value_counts())

结果:

0    1238
1    1200
Name: class, dtype: int64
1    3008
0    2678
Name: class, dtype: int64

我不认为这会对模型的准确性产生很大的影响.

I don't see how this would affect the model's accuracy in such a big way though.

Edit2: 遵循PV8的建议,我尝试通过使用交叉验证来验证我的结果,而且似乎可以解决问题,通过这种方式,我得到的结果要合理得多.

Following PV8's advice I've tried verifying my results by using cross validation and it seems to do the trick, I'm getting much more reasonable results this way.

model = LinearSVC()
scores = cross_val_score(model,x,y,cv=5)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

输出:

[1.         1.         1.         1.         0.75246305]
Accuracy: 0.95 (+/- 0.20)

推荐答案

这可能是正常现象,您尝试了几次改组?

This can be normal behavior, how many shuffles did you try?

这表明您的数据与拆分数据的方式相当不符.我希望您测量的是测试的准确性,而不是火车的准确性?

This is indicating that your data is quite fluaktiv to the way you split it. I hope you measured the test accuracy and not the train one?

我建议您使用交叉验证,这将为您提供帮助验证您的总体结果.

I would suggest you to use cross validation, this will help you to verify your general results.

这篇关于混排数据行时分类器的准确度为100%的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆