Pyspark-多列聚合 [英] Pyspark - Aggregation on multiple columns

查看:259
本文介绍了Pyspark-多列聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下数据.文件名:babynames.csv.

I have data like below. Filename:babynames.csv.

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

我需要根据年份和性别对输入进行排序,并且希望将输出汇总如下(此输出将分配给新的RDD).

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

我不确定在pyspark中执行以下步骤后如何继续.在这方面需要您的帮助

I am not sure how to proceed after the following step in pyspark. Need your help on this

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????

推荐答案

  1. 按照自述文件中的说明进行操作,以包含 spark-csv
  2. 加载数据

  1. Follow the instructions from the README to include spark-csv package
  2. Load data

df = (sqlContext.read
    .format("com.databricks.spark.csv")
    .options(inferSchema="true", delimiter=";", header="true")
    .load("babynames.csv"))

  • 导入所需功能

  • Import required functions

    from pyspark.sql.functions import count, avg
    

  • 分组依据和汇总(可选使用Column.alias:

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    

  • 或者:

    • percent转换为数字
    • 重塑为格式((yearsex),percent)
    • aggregateByKey使用pyspark.statcounter.StatCounter
    • cast percent to numeric
    • reshape to a format ((year, sex), percent)
    • aggregateByKey using pyspark.statcounter.StatCounter

    这篇关于Pyspark-多列聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆