Random_state对准确性的贡献 [英] Random_state's contribution to accuracy
问题描述
好的,这很有趣.我执行了相同的代码几次,每次得到不同的 accuracy_score
.我发现在 train_test拆分
时,我没有使用任何 random_state
值.因此我使用了 random_state = 0
并获得了82%的一致 Accuracy_score
.但...然后我想尝试使用不同的 random_state
数字,然后将 random_state = 128
设置为 Accuracy_score
变为84%.现在,我需要了解为什么会这样以及 random_state
如何影响模型的准确性.输出如下: 1>没有random_state:
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[90 22][21 46]]0.7597765363128491runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[104 16][14 45]]0.8324022346368715runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[90 18][12 59]0.8324022346368715runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[99 9][19 52]0.8435754189944135
2>随机状态= 128(准确率= 84%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[106 13][15 45]]0.8435754189944135runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[106 13][15 45]]0.8435754189944135
3>随机状态= 0(准确率= 82%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[93 17][15 54]0.8212290502793296runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[93 17][15 54]0.8212290502793296
基本上, random_state
将通过每次执行相同的精确数据拆分来确保每次代码输出相同的结果.这对于您最初的培训/测试拆分以及创建其他人可以完全复制的代码很有帮助.
以相同或不同的方式分割数据
首先要了解的是,如果您不使用 random_state
,那么每次都会分别不同地拆分数据,这意味着您需要进行训练和测试设置会有所不同.这可能不会带来很大的不同,但是会导致模型参数/准确度等方面的轻微变化.如果您每次都将 random_state
设置为相同的值,像 random_state = 0
一样,那么每次都会以相同的方式拆分数据.
每个random_state导致不同的拆分
要了解的第二件事是,每个 random_state
值将导致不同的拆分和不同的行为.因此,如果您希望能够复制结果,则需要将 random_state
保留为相同的值.
您的模型可以包含多个random_state件
要了解的第三件事是模型的多个部分可能具有随机性.例如,您的 train_test_split
可以接受 random_state
,但是 RandomForestClassifier
也可以.因此,为了每次都获得完全相同的结果,您需要为模型中每个具有随机性的模型设置 random_state
.
结论
如果您使用 random_state
进行初始训练/测试拆分,则将需要设置一次并继续使用该拆分以避免对测试集的过度拟合./p>
通常来说,您可以使用交叉验证来评估模型的准确性,而不必担心 random_state
.
一个非常重要的注意事项是,您不应使用 random_state
来尝试提高模型的准确性.根据定义,这将导致模型过度拟合您的模型数据,而对于未见到的数据也不能一概而论.
Okay, this is interesting..
I executed the same code a couple of times and each time I got a different accuracy_score
.
I figured that I was not using any random_state
value while train_test splitting
. so I used random_state=0
and got consistent Accuracy_score
of 82%. but...
then I thought to give it a try with different random_state
number and I set random_state=128
and Accuracy_score
becomes 84%.
Now I need to understand why is that and how random_state
affects the accuracy of the model.
Outputs are as below:
1> without random_state:
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 22]
[21 46]]
0.7597765363128491
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[104 16]
[ 14 45]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[90 18]
[12 59]]
0.8324022346368715
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[99 9]
[19 52]]
0.8435754189944135
2> with random_state = 128 (Accuracy_score = 84%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[106 13]
[ 15 45]]
0.8435754189944135
3> with random_state = 0 (Accuracy_score = 82%)
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')
: boolean
use_inf_as_null had been deprecated and will be removed in a future
version. Use `use_inf_as_na` instead.
[[93 17]
[15 54]]
0.8212290502793296
Essentially random_state
is going to make sure your code outputs the same results each time, by doing the same exact data splits each time. This is mostly helpful for your initial train/test split, and for creating code that others can replicate exactly.
Splitting the data the same vs. differently
The first thing to understand is that if you don't use random_state
, then the data will be split differently each time, which means that your training set and test sets will be different. This might not make a huge different, but it will result in slight variations in your model parameters / accuracy / etc. If you do set random_state
to the same value each time, like random_state=0
, then the data will be split the same way each time.
Each random_state results in a different split
The second thing to understand is that each random_state
value will result in different splits and different behavior. So you need to keep random_state
as the same value if you want to be able to replicate results.
Your model can have multiple random_state pieces
The third thing to understand is that multiple pieces of your model might have randomness in them. For example, your train_test_split
can accept random_state
, but so can RandomForestClassifier
. So in order to get the exact same results each time, you'll need to set random_state
for each piece of your model that has randomness in it.
Conclusions
If you're using random_state
to do your initial train/test split, you're going to want to set it once and use that split going forward to avoid overfitting to your test set.
Generally speaking, you can use cross-validation to assess the accuracy of your model and not worry too much about the random_state
.
A very important note is that you should not use random_state
to try to improve the accuracy of your model. This is by definition going to result in your model overfitting your data, and not generalizing as well to unseen data.
这篇关于Random_state对准确性的贡献的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!