系统之间的随机种子是否兼容? [英] Are random seeds compatible between systems?

查看:29
本文介绍了系统之间的随机种子是否兼容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 python 的 sklearn 包制作了一个随机森林模型,其中我将种子设置为例如 1234.为了生产模型,我们使用 pyspark.如果我要传递相同的超参数和相同的种子值,即 1234,它会得到相同的结果吗?

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same results?

基本上,随机种子数在不同系统之间是否有效?

Basically, do random seed numbers work between different systems?

推荐答案

嗯,这正是可以真正用一些实验解决的问题.提供的代码片段...

Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...

无论如何,似乎普遍的答案是肯定的:不仅在 Python 和 Spark MLlib 之间,甚至在 Spark 子模块之间,或者在 Python 和麻木...

Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...

这是一些可重现的代码,运行在 Databricks 社区云(其中 pyspark 已经导入并且相关上下文已经初始化):

Here is some reproducible code, run in the Databricks community cloud (where pyspark is already imported & the relevant contexts initialized):

import sys

import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r  # avoid conflict with native Python random module

print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)

# Spark version 2.3.1 
# Python version 3.5.2 
# Numpy version 1.11.1

s = 1234 # RNG seed


# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))


# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results

random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]

df = pd.DataFrame({'uniform':x, 'normal':y})


# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10)  # again, np.random.rand(10) gives exact same results

np.random.seed(s)
yy = np.random.randn(10)

numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})


# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()

rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})

这是结果:

原生 Python 3:

Native Python 3:

# df

     normal  uniform
0  1.430825 0.966454
1  1.803801 0.440733 
2  0.321290 0.007491 
3  0.599006 0.910976 
4 -0.700891 0.939269 
5  0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8  0.131975 0.766481
9  0.191054 0.236810

麻木:

# numpy_df

     normal  uniform
0  0.471435 0.191519
1 -1.190976 0.622109 
2  1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5  0.887163 0.272593
6  0.859588 0.276464 
7 -0.636524 0.801872 
8  0.015696 0.958139
9 -2.242685 0.875933

Spark SQL:

# spark_df.show()

+---+--------------------+-------------------+ 
| id|              normal|            uniform|
+---+--------------------+-------------------+
|  0|  0.9707422835368164| 0.9499610869333489| 
|  1|  0.3641589200870126| 0.9682554532421536|
|  2|-0.22282955491417034|0.20293463923130883|
|  3|-0.00607734375219...|0.49540111648680385|
|  4|  -0.603246393509015|0.04350782074761239|
|  5|-0.12066287904491797|0.09390549680302918|
|  6|  0.2899567922101867| 0.6789838400775526|
|  7|  0.5827830892516723| 0.6560703836291193|
|  8|   1.351649207673346| 0.7750229279150739|
|  9|  0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+

Spark MLlib:

Spark MLlib:

# rdd_df

     normal  uniform 
0 -0.957840 0.259282 
1  0.742598 0.674052 
2  0.225768 0.707127 
3  1.109644 0.850683 
4 -0.269745 0.414752 
5 -0.148916 0.494394 
6  0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9  1.366452 0.703145

当然,即使上述结果相同,也不能保证例如 scikit-learn 中的随机森林的结果与 pyspark 随机森林的结果完全相同...

Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...

尽管答案是否定的,但我真的看不出这会如何影响任何 ML 系统的部署,即如果结果至关重要依赖于 RNG,那么某些事情肯定是不对...

Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...

这篇关于系统之间的随机种子是否兼容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆