使用pyspark进行分层抽样 [英] Stratified sampling with pyspark
问题描述
我有一个Spark DataFrame
,其中的一列包含很多零,很少(仅占0.01%).
I have a Spark DataFrame
that has one column that has lots of zeros and very few ones (only 0.01% of ones).
我想抽取一个随机子样本,但采用分层的子样本-以便在该列中保持1s与0s的比率.
I'd like to take a random subsample but a stratified one - so that it keeps the ratio of 1s to 0s in that column.
可以在pyspark中做吗?
Is it possible to do in pyspark ?
我正在寻找基于DataFrame
s而不基于RDD
的非scala 解决方案.
I am looking for a non-scala solution and on based on DataFrame
s and not RDD
-based.
推荐答案
我在 Spark中的分层采样中建议的解决方案 非常容易直接从 Scala 转换为 Python (甚至转换为 Java -
The solution I suggested in Stratified sampling in Spark is pretty straightforward to convert from Scala to Python (or even to Java - What's the easiest way to stratify a Spark Dataset ?).
尽管如此,我还是将其重写为 python .首先创建一个玩具DataFrame
:
Nevertheless, I'll rewrite it python. Let's start first by creating a toy DataFrame
:
from pyspark.sql.functions import lit
list = [(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)]
df = spark.createDataFrame(list, ["x1","x2","x3"])
df.show()
# +----------+----------+---+
# | x1| x2| x3|
# +----------+----------+---+
# |2147481832| 23355149| 1|
# |2147481832| 973010692| 1|
# |2147481832|2134870842| 1|
# |2147481832| 541023347| 1|
# |2147481832|1682206630| 1|
# |2147481832|1138211459| 1|
# |2147481832| 852202566| 1|
# |2147481832| 201375938| 1|
# |2147481832| 486538879| 1|
# |2147481832| 919187908| 1|
# | 214748183| 919187908| 1|
# | 214748183| 91187908| 1|
# +----------+----------+---+
如您所见,该DataFrame
具有12个元素:
This DataFrame
has 12 elements as you can see :
df.count()
# 12
分布如下:
df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 10|
# | 214748183| 2|
# +----------+-----+
现在让我们采样:
首先,我们将设置种子:
First we'll set the seed :
seed = 12
找到要进行采样并进行采样的键:
The find the keys to fraction on and sample :
fractions = df.select("x1").distinct().withColumn("fraction", lit(0.8)).rdd.collectAsMap()
print(fractions)
# {2147481832: 0.8, 214748183: 0.8}
sampled_df = df.stat.sampleBy("x1", fractions, seed)
sampled_df.show()
# +----------+---------+---+
# | x1| x2| x3|
# +----------+---------+---+
# |2147481832| 23355149| 1|
# |2147481832|973010692| 1|
# |2147481832|541023347| 1|
# |2147481832|852202566| 1|
# |2147481832|201375938| 1|
# |2147481832|486538879| 1|
# |2147481832|919187908| 1|
# | 214748183|919187908| 1|
# | 214748183| 91187908| 1|
# +----------+---------+---+
我们现在可以检查示例的内容:
We can now check the content of our sample :
sampled_df.count()
# 9
sampled_df.groupBy("x1").count().show()
# +----------+-----+
# | x1|count|
# +----------+-----+
# |2147481832| 7|
# | 214748183| 2|
# +----------+-----+
这篇关于使用pyspark进行分层抽样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!