如何将spark数据帧减少到列中每个值的最大行数? [英] How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?
问题描述
我需要减少 datafame 并将其导出到镶木地板.我需要确保我有前任.一列中的每个值对应 10000 行.
I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.
我正在使用的数据框如下所示:
The dataframe I am working with looks like the following:
+-------------+-------------------+
| Make| Model|
+-------------+-------------------+
| PONTIAC| GRAND AM|
| BUICK| CENTURY|
| LEXUS| IS 300|
|MERCEDES-BENZ| SL-CLASS|
| PONTIAC| GRAND AM|
| TOYOTA| PRIUS|
| MITSUBISHI| MONTERO SPORT|
|MERCEDES-BENZ| SLK-CLASS|
| TOYOTA| CAMRY|
| JEEP| WRANGLER|
| CHEVROLET| SILVERADO 1500|
| TOYOTA| AVALON|
| FORD| RANGER|
|MERCEDES-BENZ| C-CLASS|
| TOYOTA| TUNDRA|
| FORD|EXPLORER SPORT TRAC|
| CHEVROLET| COLORADO|
| MITSUBISHI| MONTERO|
| DODGE| GRAND CARAVAN|
+-------------+-------------------+
我需要为每个模型返回最多 10,000 行:
I need to return at most 10,000 rows for each model:
+--------------------+-------+
| Model| count|
+--------------------+-------+
| MDX|1658647|
| ASTRO| 682657|
| ENTOURAGE| 72622|
| ES 300H| 80712|
| 6 SERIES| 145252|
| GRAN FURY| 9719|
|RANGE ROVER EVOQU...| 4290|
| LEGACY WAGON| 2070|
| LEGACY SEDAN| 104|
| DAKOTA CHASSIS CAB| 8|
| CAMARO|2028678|
| XT| 10009|
| DYNASTY| 171776|
| 944| 43044|
| F430 SPIDER| 506|
|FLEETWOOD SEVENTY...| 6|
| MONTE CARLO|1040806|
| LIBERTY|2415456|
| ESCALADE| 798832|
| SIERRA 3500 CLASSIC| 9541|
+--------------------+-------+
这个问题是不一样,因为正如下面其他人所建议的那样,它只检索值大于其他值的行.我希望 对于 df['Model'] 中的每个值:如果有 10,000 行或更多行,则将该值(模型)的行数限制为 10,000 行
(显然是伪代码).换句话说,如果超过 10,000行,去掉其余的行,否则保留所有行.
This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows
(Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.
推荐答案
我想你应该把 row_number
与 window
、orderBy
和partitionBy
查询结果,然后您可以使用您的限制进行过滤.例如,获取随机 shuffle 并将样本限制为每个值 10,000 行,如下所示:
I guess you should put row_number
with window
, orderBy
, and partitionBy
to query the result and then you can filter with your limit. For example, getting a random shuffle and limiting the sample to 10,000 rows per value is demonstrated by the following:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = Window.partitionBy(df['Model']).orderBy(F.rand())
df = df.select(F.col('*'),
F.row_number().over(window).alias('row_number')) \
.where(F.col('row_number') <= 10000)
这篇关于如何将spark数据帧减少到列中每个值的最大行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!