如何将spark数据帧减少到列中每个值的最大行数? [英] How do I reduce a spark dataframe to a maximum amount of rows for each value in a column?

查看:61
本文介绍了如何将spark数据帧减少到列中每个值的最大行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要减少 datafame 并将其导出到镶木地板.我需要确保我有前任.一列中的每个值对应 10000 行.

I need to reduce a datafame and export it to a parquet. I need to make sure that I have ex. 10000 rows for each value in a column.

我正在使用的数据框如下所示:

The dataframe I am working with looks like the following:

+-------------+-------------------+
|         Make|              Model|
+-------------+-------------------+
|      PONTIAC|           GRAND AM|
|        BUICK|            CENTURY|
|        LEXUS|             IS 300|
|MERCEDES-BENZ|           SL-CLASS|
|      PONTIAC|           GRAND AM|
|       TOYOTA|              PRIUS|
|   MITSUBISHI|      MONTERO SPORT|
|MERCEDES-BENZ|          SLK-CLASS|
|       TOYOTA|              CAMRY|
|         JEEP|           WRANGLER|
|    CHEVROLET|     SILVERADO 1500|
|       TOYOTA|             AVALON|
|         FORD|             RANGER|
|MERCEDES-BENZ|            C-CLASS|
|       TOYOTA|             TUNDRA|
|         FORD|EXPLORER SPORT TRAC|
|    CHEVROLET|           COLORADO|
|   MITSUBISHI|            MONTERO|
|        DODGE|      GRAND CARAVAN|
+-------------+-------------------+

我需要为每个模型返回最多 10,000 行:

I need to return at most 10,000 rows for each model:

+--------------------+-------+
|               Model|  count|
+--------------------+-------+
|                 MDX|1658647|
|               ASTRO| 682657|
|           ENTOURAGE|  72622|
|             ES 300H|  80712|
|            6 SERIES| 145252|
|           GRAN FURY|   9719|
|RANGE ROVER EVOQU...|   4290|
|        LEGACY WAGON|   2070|
|        LEGACY SEDAN|    104|
|  DAKOTA CHASSIS CAB|      8|
|              CAMARO|2028678|
|                  XT|  10009|
|             DYNASTY| 171776|
|                 944|  43044|
|         F430 SPIDER|    506|
|FLEETWOOD SEVENTY...|      6|
|         MONTE CARLO|1040806|
|             LIBERTY|2415456|
|            ESCALADE| 798832|
| SIERRA 3500 CLASSIC|   9541|
+--------------------+-------+

这个问题是不一样,因为正如下面其他人所建议的那样,它只检索值大于其他值的行.我希望 对于 df['Model'] 中的每个值:如果有 10,000 行或更多行,则将该值(模型)的行数限制为 10,000 行(显然是伪代码).换句话说,如果超过 10,000行,去掉其余的行,否则保留所有行.

This question is not the same because it, as others have suggested below, only retrieves rows where a value is greater than other values. I want for each value in df['Model']: limit rows for that value(model) to 10,000 if there are 10,000 or more rows (Pseudo-code obviously). In other words, if there are more than 10,000 rows, get rid of the rest, otherwise leave all rows.

推荐答案

我想你应该把 row_numberwindoworderBypartitionBy 查询结果,然后您可以使用您的限制进行过滤.例如,获取随机 shuffle 并将样本限制为每个值 10,000 行,如下所示:

I guess you should put row_number with window, orderBy, and partitionBy to query the result and then you can filter with your limit. For example, getting a random shuffle and limiting the sample to 10,000 rows per value is demonstrated by the following:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = Window.partitionBy(df['Model']).orderBy(F.rand())
df = df.select(F.col('*'), 
               F.row_number().over(window).alias('row_number')) \
               .where(F.col('row_number') <= 10000)

这篇关于如何将spark数据帧减少到列中每个值的最大行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆