pyspark - 在 Spark 会话中获得一致的随机值 [英] pyspark - get consistent random value across Spark sessions

查看:18
本文介绍了pyspark - 在 Spark 会话中获得一致的随机值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将一列随机值添加到我正在测试的数据帧(每行都有一个 id).我正在努力在 Spark 会话中获得可重现的结果 - 每行 id 的随机值相同.我可以使用

I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using

from pyspark.sql.functions import rand

new_df = my_df.withColumn("rand_index", rand(seed = 7))

但它仅在我在同一个 Spark 会话中运行时才有效.重新启动 Spark 并运行我的脚本后,我没有得到相同的结果.

but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.

我还尝试定义一个 udf,测试我是否可以在一个区间内生成随机值(整数),并使用 Python 中的 randomrandom.seed

I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set

import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())

但无济于事.

有没有办法确保跨 Spark 会话生成可重复的随机数这样行 ID 获得相同的随机值?我真的很感激一些指导:)感谢您的帮助!

Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :) Thanks for the help!

推荐答案

我怀疑您为种子获得了相同的通用值,但根据从磁盘读取时受数据分布影响的分区而以不同的顺序并且每次可能有更多或更少的数据.但实际上我并不知道你的代码.

I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.

rand 函数生成相同的随机数据(否则种子的意义是什么),并且分区以某种方式获得了它的一部分.如果你看,你应该猜出模式!

The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!

这是 2 个不同基数数据帧的示例.您可以看到种子给出了相同或超集的结果.因此,排序和分区在 imo 中发挥作用.

Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))                                   
df1.show()

df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))

返回:

+---+-------------------+
| id|         rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+

8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]

同样的数据更多:

...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...

返回:

+---+-------------------+
| id|         rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+

8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]

您可以看到 4 个常见的随机值 - 在 Spark 会话内或会话外.

You can see 4 common random values - within a Spark session or out of session.

这篇关于pyspark - 在 Spark 会话中获得一致的随机值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆