Random_state对准确性的贡献 [英] Random_state's contribution to accuracy

查看:107
本文介绍了Random_state对准确性的贡献的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,这很有趣.我执行了相同的代码几次,每次得到不同的 accuracy_score .我发现在 train_test拆分时,我没有使用任何 random_state 值.因此我使用了 random_state = 0 并获得了82%的一致 Accuracy_score .但...然后我想尝试使用不同的 random_state 数字,然后将 random_state = 128 设置为 Accuracy_score 变为84%.现在,我需要了解为什么会这样以及 random_state 如何影响模型的准确性.输出如下: 1>没有random_state:

  runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[90 22][21 46]]0.7597765363128491runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[104 16][14 45]]0.8324022346368715runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[90 18][12 59]0.8324022346368715runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[99 9][19 52]0.8435754189944135 

2>随机状态= 128(准确率= 84%)

  runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[106 13][15 45]]0.8435754189944135runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[106 13][15 45]]0.8435754189944135 

3>随机状态= 0(准确率= 82%)

  runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/数据集/泰坦尼克号"):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[93 17][15 54]0.8212290502793296runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py',wdir ='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic'):布尔值use_inf_as_null已被弃用,以后将被删除版本.请改用`use_inf_as_na`.[[93 17][15 54]0.8212290502793296 

解决方案

基本上, random_state 将通过每次执行相同的精确数据拆分来确保每次代码输出相同的结果.这对于您最初的培训/测试拆分以及创建其他人可以完全复制的代码很有帮助.

以相同或不同的方式分割数据

首先要了解的是,如果您不使用 random_state ,那么每次都会分别不同地拆分数据,这意味着您需要进行训练和测试设置会有所不同.这可能不会带来很大的不同,但是会导致模型参数/准确度等方面的轻微变化.如果您每次都将 random_state 设置为相同的值,像 random_state = 0 一样,那么每次都会以相同的方式拆分数据.

每个random_state导致不同的拆分

要了解的第二件事是,每个 random_state 值将导致不同的拆分和不同的行为.因此,如果您希望能够复制结果,则需要将 random_state 保留为相同的值.

您的模型可以包含多个random_state件

要了解的第三件事是模型的多个部分可能具有随机性.例如,您的 train_test_split 可以接受 random_state ,但是 RandomForestClassifier 也可以.因此,为了每次都获得完全相同的结果,您需要为模型中每个具有随机性的模型设置 random_state .

结论

如果您使用 random_state 进行初始训练/测试拆分,则将需要设置一次并继续使用该拆分以避免对测试集的过度拟合./p>

通常来说,您可以使用交叉验证来评估模型的准确性,而不必担心 random_state .

一个非常重要的注意事项是,您不应使用 random_state 来尝试提高模型的准确性.根据定义,这将导致模型过度拟合您的模型数据,而对于未见到的数据也不能一概而论.

Okay, this is interesting.. I executed the same code a couple of times and each time I got a different accuracy_score. I figured that I was not using any random_state value while train_test splitting. so I used random_state=0 and got consistent Accuracy_score of 82%. but... then I thought to give it a try with different random_state number and I set random_state=128 and Accuracy_score becomes 84%. Now I need to understand why is that and how random_state affects the accuracy of the model. Outputs are as below: 1> without random_state:

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 22]
 [21 46]]
0.7597765363128491

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[104  16]
 [ 14  45]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[90 18]
 [12 59]]
0.8324022346368715

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[99  9]
 [19 52]]
0.8435754189944135

2> with random_state = 128 (Accuracy_score = 84%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[106  13]
 [ 15  45]]
0.8435754189944135

3> with random_state = 0 (Accuracy_score = 82%)

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

runfile('C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic/Colab File.py', wdir='C:/Users/spark/OneDrive/Documents/Machine Learing/Datasets/Titanic')

: boolean
    use_inf_as_null had been deprecated and will be removed in a future
    version. Use `use_inf_as_na` instead.

[[93 17]
 [15 54]]
0.8212290502793296

解决方案

Essentially random_state is going to make sure your code outputs the same results each time, by doing the same exact data splits each time. This is mostly helpful for your initial train/test split, and for creating code that others can replicate exactly.

Splitting the data the same vs. differently

The first thing to understand is that if you don't use random_state, then the data will be split differently each time, which means that your training set and test sets will be different. This might not make a huge different, but it will result in slight variations in your model parameters / accuracy / etc. If you do set random_state to the same value each time, like random_state=0, then the data will be split the same way each time.

Each random_state results in a different split

The second thing to understand is that each random_state value will result in different splits and different behavior. So you need to keep random_state as the same value if you want to be able to replicate results.

Your model can have multiple random_state pieces

The third thing to understand is that multiple pieces of your model might have randomness in them. For example, your train_test_split can accept random_state, but so can RandomForestClassifier. So in order to get the exact same results each time, you'll need to set random_state for each piece of your model that has randomness in it.

Conclusions

If you're using random_state to do your initial train/test split, you're going to want to set it once and use that split going forward to avoid overfitting to your test set.

Generally speaking, you can use cross-validation to assess the accuracy of your model and not worry too much about the random_state.

A very important note is that you should not use random_state to try to improve the accuracy of your model. This is by definition going to result in your model overfitting your data, and not generalizing as well to unseen data.

这篇关于Random_state对准确性的贡献的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆