机器学习算法得分变化而数据或步长没有任何变化 [英] Machine learning algorithm score changes without any change in data or step

查看:93
本文介绍了机器学习算法得分变化而数据或步长没有任何变化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是机器学习的新手,并且刚开始在Kaggle上遇到泰坦尼克号问题.我编写了一种简单的算法来预测测试数据的结果.

I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data.

我的问题/困惑是,每次我使用相同的数据集和相同的步骤执行算法时,得分值都会发生变化(代码中的最后一条语句).我无法理解这种行为吗?

My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in the code). I am not able to understand this behaviour?

代码:

# imports
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier

# load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
results = pd.read_csv('gender_submission-orig.csv')

# prepare training and test dataset
y = train['Survived']
X = train.drop(['Survived', 'SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
test = test.drop(['SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
y_test = results['Survived']

X = pd.get_dummies(X)
test = pd.get_dummies(test)

# fill the missing values
age_median = X['Age'].median()
fare_median = X['Fare'].median()

X['Age'] = X['Age'].fillna(age_median)
test['Age'].fillna(age_median, inplace=True)
test['Fare'].fillna(fare_median, inplace=True)

# train the classifier and predict
clf = DecisionTreeClassifier()
clf.fit(X, y)
predict = clf.predict(test)

# This is the score which changes with execution.
print(round(clf.score(test, y_test) * 100, 2)) 

推荐答案

这是该领域新手经常遇到的挫败感.原因是这种算法固有的随机性,而简单的正如评论中已经建议的那样,直接的补救方法是显式设置随机数生成器的状态(种子),例如:

This is a usual frustration with which newcomers in the field are faced. The cause is the inherent randomness in this kind of algorithms, and the simple & straightforward remedy, as already has been suggested in the comments, is to explicitly set the state (seed) of the random number generator, e.g.:

clf = DecisionTreeClassifier(random_state=42) 

但是使用不同的值,得分也会改变.那么,我们如何找到最佳或正确的价值呢?

But with the different values, the score also changes. So how do we find the optimal or right value?

同样,这是预料之中的,并且不能克服:这种随机性是基本的&不可逆转的,超出此范围您将无法前进.如上所述,设置随机种子仅能确保特定模型/脚本的可重复性,但是无法找到您在此表示的意思(即关于随机部分)的最佳"值.从统计学上讲,由随机种子的不同值产生的结果应该是相似的(在统计意义上),但是对这种相似性进行精确的量化是严格统计中的一项工作,远远超出了本文的范围.

Again, this is expected and it cannot be overcome: this kind of randomness is a fundamental & irreversible one, beyond which you simply cannot go. Setting the random seed as suggested above just ensures reproducibility of a specific model/script, but finding any "optimal" value in the sense you mean it here (i.e. regarding the random parts) is not possible. Statistically speaking, the results produced by different values of the random seed should be similar (in the statistical sense), but exact quantification of this similarity is an exercise in rigorous statistics that goes well beyond the scope of this post.

随机性通常是一个非直觉的领域,随机数生成器(RNG)本身就是奇怪的动物...作为一般说明,您可能想知道.

Randomness is often a non-intuitive realm, and random number generators (RNGs) themselves are strange animals... As a general note, you might be interested to know that RNG's are not even "compatible" across different languages & frameworks.

这篇关于机器学习算法得分变化而数据或步长没有任何变化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆