pyspark-在Spark会话中获得一致的随机值 [英] pyspark - get consistent random value across Spark sessions

查看:167
本文介绍了pyspark-在Spark会话中获得一致的随机值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在数据帧中添加一列随机值(每行都有一个ID ),以进行测试.我努力在Spark会话中获得可重复的结果-每个行ID都具有相同的随机值.我可以通过使用

I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using

from pyspark.sql.functions import rand

new_df = my_df.withColumn("rand_index", rand(seed = 7))

,但是仅当我在同一Spark会话中运行它时,它才起作用.重新启动Spark并运行脚本后,我没有得到相同的结果.

but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.

我还尝试定义udf,测试是否可以在一个间隔内生成随机值(整数),并使用Python中的 random 并设置了 random.seed

I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set

import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())

但无济于事.

是否有一种方法可以确保在Spark会话中可重复生成的随机数,以使行id获得相同的随机值?我真的很感谢一些指导:) 感谢您的帮助!

Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :) Thanks for the help!

推荐答案

我怀疑您为种子获得了相同的公用值,但是根据分区的不同顺序,分区的顺序受磁盘读取时数据分布的影响并且每次可能会有更多或更少的数据.但是我实际上并不了解您的代码.

I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.

rand函数生成相同的随机数据(否则,种子的关键是什么),并且分区以某种方式获得了它的一部分.如果你看的话,你应该猜模式!

The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!

这里是2个不同基数数据帧的示例.您可以看到种子给出的结果相同或超集.因此,排序和分区在imo中起着作用.

Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))                                   
df1.show()

df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))

返回:

+---+-------------------+
| id|         rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+

8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]

再次相同,但包含更多数据:

The same again with more data:

...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...

返回:

+---+-------------------+
| id|         rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+

8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]

您可以看到4个常见的随机值-在Spark会话内或会话外.

You can see 4 common random values - within a Spark session or out of session.

这篇关于pyspark-在Spark会话中获得一致的随机值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆