使用列条件随机采样 Pyspark 数据框 [英] Randomly Sample Pyspark dataframe with column conditions

查看:104
本文介绍了使用列条件随机采样 Pyspark 数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对列值满足特定条件的 Pyspark 数据帧进行随机采样.我想使用 sample 方法根据列值随机选择行.假设我有以下数据框:

+---+----+------+-------------+------+|身份证|代码|amt|flag_outliers|结果|+---+----+------+------------+------+|1|一个|10.9|0|0.0||2|乙|20.7|0|0.0||3|| |30.4|0|1.0||4|d|40.98|0|1.0||5|e|50.21|0|2.0||6|f|60.7|0|2.0||7|克|70.8|0|2.0||8|高|80.43|0|3.0||9|我|90.12|0|3.0||10|j|100.65|0|3.0|+---+----+------+------------+------+

我想根据 result 列仅对 0,1,2,3 中的每一个采样 1 个(或任何特定数量),所以我会最终得到这个:

+---+----+------+-------------+------+|身份证|代码|amt|flag_outliers|结果|+---+----+------+------------+------+|1|一个|10.9|0|0.0||3|| |30.4|0|1.0||5|e|50.21|0|2.0||8|高|80.43|0|3.0|+---+----+------+------------+------+

是否有一种很好的编程方式来实现这一点,即为特定列中给定的每个值取相同数量的行?非常感谢任何帮助!

解决方案

您可以使用 sampleBy() 返回一个分层样本,根据每个层给出的分数没有替换.

<预><代码>>>>从 pyspark.sql.functions 导入列>>>数据集 = sqlContext.range(0, 100).select((col("id") % 3).alias("result"))>>>采样= dataset.sampleBy(结果",分数={0:0.1,1:0.2},种子=0)>>>sampled.groupBy("result").count().orderBy("key").show()+------+-----+|结果|计数|+------+-----+|0|5||1|9|+------+-----+

I'm trying to randomly sample a Pyspark dataframe where a column value meets a certain condition. I would like to use the sample method to randomly select rows based on a column value. Let's say I have the following data frame:

+---+----+------+-------------+------+
| id|code|   amt|flag_outliers|result|
+---+----+------+-------------+------+
|  1|   a|  10.9|            0|   0.0|
|  2|   b|  20.7|            0|   0.0|
|  3|   c|  30.4|            0|   1.0|
|  4|   d| 40.98|            0|   1.0|
|  5|   e| 50.21|            0|   2.0|
|  6|   f|  60.7|            0|   2.0|
|  7|   g|  70.8|            0|   2.0|
|  8|   h| 80.43|            0|   3.0|
|  9|   i| 90.12|            0|   3.0|
| 10|   j|100.65|            0|   3.0|
+---+----+------+-------------+------+

I would like to sample only 1(or any certain amount) of each of the 0, 1, 2, 3 based on the result column so I'd end up with this:

+---+----+------+-------------+------+
| id|code|   amt|flag_outliers|result|
+---+----+------+-------------+------+
|  1|   a|  10.9|            0|   0.0|
|  3|   c|  30.4|            0|   1.0|
|  5|   e| 50.21|            0|   2.0|
|  8|   h| 80.43|            0|   3.0|
+---+----+------+-------------+------+

Is there a good programmatic way to achieve this, i.e take the same number of rows for each of the values given in a certain column? Any help is really appreciated!

解决方案

You can use sampleBy() which returns a stratified sample without replacement based on the fraction given on each stratum.

>>> from pyspark.sql.functions import col
>>> dataset = sqlContext.range(0, 100).select((col("id") % 3).alias("result"))
>>> sampled = dataset.sampleBy("result", fractions={0: 0.1, 1: 0.2}, seed=0)
>>> sampled.groupBy("result").count().orderBy("key").show()

+------+-----+
|result|count|
+------+-----+
|     0|    5|
|     1|    9|
+------+-----+

这篇关于使用列条件随机采样 Pyspark 数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆