PySpark:将数据框中的行随机化 [英] PySpark: Randomize rows in dataframe

查看:88
本文介绍了PySpark:将数据框中的行随机化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想随机化数据框中的行.我尝试通过给分数1采样来采样数据,这是行不通的(有趣的是,这在Pandas中是行得通的.)

I have a dataframe and I want to randomize rows in the dataframe. I tried sampling the data by giving a fraction of 1, which didn't work (interestingly this works in Pandas).

推荐答案

它在Pandas中有效,因为在本地系统中采样通常是通过改组数据来解决的.另一方面,Spark通过对数据执行线性扫描来避免混洗.这意味着在Spark中进行采样只会随机化样本成员,而不会使订单随机化.

It works in Pandas because taking sample in local systems is typically solved by shuffling data. Spark from the other hand avoids shuffling by performing linear scans over the data. It means that sampling in Spark only randomizes members of the sample not an order.

您可以按一列随机数对DataFrame进行排序:

You can order DataFrame by a column of random numbers:

from pyspark.sql.functions import rand 

df = sc.parallelize(range(20)).map(lambda x: (x, )).toDF(["x"])
df.orderBy(rand()).show(3)

## +---+
## |  x|
## +---+
## |  2|
## |  7|
## | 14|
## +---+
## only showing top 3 rows

但它是:

  • 昂贵-因为它需要完全随机播放,并且通常需要避免这种情况.
  • 可疑-因为在非平凡的情况下,DataFrame中的值顺序并不是您真正可以依赖的东西,并且由于DataFrame不支持索引,因此如果不进行收集,它就毫无用处.
  • expensive - because it requires full shuffle and it something you typically want to avoid.
  • suspicious - because order of values in a DataFrame is not something you can really depend on in non-trivial cases and since DataFrame doesn't support indexing it is relatively useless without collecting.

这篇关于PySpark:将数据框中的行随机化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆