使用列条件随机采样 Pyspark 数据框 [英] Randomly Sample Pyspark dataframe with column conditions
问题描述
我正在尝试对列值满足特定条件的 Pyspark 数据帧进行随机采样.我想使用 sample
方法根据列值随机选择行.假设我有以下数据框:
+---+----+------+-------------+------+|身份证|代码|amt|flag_outliers|结果|+---+----+------+------------+------+|1|一个|10.9|0|0.0||2|乙|20.7|0|0.0||3|| |30.4|0|1.0||4|d|40.98|0|1.0||5|e|50.21|0|2.0||6|f|60.7|0|2.0||7|克|70.8|0|2.0||8|高|80.43|0|3.0||9|我|90.12|0|3.0||10|j|100.65|0|3.0|+---+----+------+------------+------+
我想根据 result
列仅对 0,1,2,3
中的每一个采样 1 个(或任何特定数量),所以我会最终得到这个:
+---+----+------+-------------+------+|身份证|代码|amt|flag_outliers|结果|+---+----+------+------------+------+|1|一个|10.9|0|0.0||3|| |30.4|0|1.0||5|e|50.21|0|2.0||8|高|80.43|0|3.0|+---+----+------+------------+------+
是否有一种很好的编程方式来实现这一点,即为特定列中给定的每个值取相同数量的行?非常感谢任何帮助!
您可以使用 sampleBy()
返回一个分层样本,根据每个层给出的分数没有替换.
I'm trying to randomly sample a Pyspark dataframe where a column value meets a certain condition. I would like to use the sample
method to randomly select rows based on a column value. Let's say I have the following data frame:
+---+----+------+-------------+------+
| id|code| amt|flag_outliers|result|
+---+----+------+-------------+------+
| 1| a| 10.9| 0| 0.0|
| 2| b| 20.7| 0| 0.0|
| 3| c| 30.4| 0| 1.0|
| 4| d| 40.98| 0| 1.0|
| 5| e| 50.21| 0| 2.0|
| 6| f| 60.7| 0| 2.0|
| 7| g| 70.8| 0| 2.0|
| 8| h| 80.43| 0| 3.0|
| 9| i| 90.12| 0| 3.0|
| 10| j|100.65| 0| 3.0|
+---+----+------+-------------+------+
I would like to sample only 1(or any certain amount) of each of the 0, 1, 2, 3
based on the result
column so I'd end up with this:
+---+----+------+-------------+------+
| id|code| amt|flag_outliers|result|
+---+----+------+-------------+------+
| 1| a| 10.9| 0| 0.0|
| 3| c| 30.4| 0| 1.0|
| 5| e| 50.21| 0| 2.0|
| 8| h| 80.43| 0| 3.0|
+---+----+------+-------------+------+
Is there a good programmatic way to achieve this, i.e take the same number of rows for each of the values given in a certain column? Any help is really appreciated!
You can use sampleBy()
which returns a stratified sample without replacement based on the fraction given on each stratum.
>>> from pyspark.sql.functions import col
>>> dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("result"))
>>> sampled = dataset.sampleBy("result", fractions={0: 0.1, 1: 0.2}, seed=0)
>>> sampled.groupBy("result").count().orderBy("key").show()
+------+-----+
|result|count|
+------+-----+
| 0| 5|
| 1| 9|
+------+-----+
这篇关于使用列条件随机采样 Pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!