星火数据框中变换多行列 [英] Spark dataframe transform multiple rows to column

查看:107
本文介绍了星火数据框中变换多行列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是引发一个新手,我想为转化下面的源数据帧(负载从JSON文件):

  +  -  + ----- + ----- +
| A |计数|大|
+ - + ----- + ----- +
| A | 1 | M1 |
| A | 1 | M2 |
| A | 2 | M3 |
| A | 3 | M4 |
| C | 4 | M1 |
| C | 1 | M2 |
| C | 2 | M3 |
| B | 3 | M1 |
| B | 4 | M3 |
| B | 5 | M4 |
|开发| 6 | M1 |
|开发| 1 | M2 |
|开发| 2 | M3 |
|开发| 3 | M4 |
|开发| 4 | M5 |
| E | 4 | M1 |
| E | 5 | M2 |
| E | 1 | M3 |
| E | 1 | M4 |
| E | 1 | M5 |
+ - + ----- + ----- +

进入下面的结果数据帧

  +  -  +  -  +  -  +  -  +  -  +  -  +
| A | M1 | M2 | M3 | M4 | M5 |
+ - + - + - + - + - + - +
| A | 1 | 1 | 2 | 3 | 0 |
| C | 4 | 2 | 1 | 0 | 0 |
| B | 3 | 0 | 4 | 5 | 0 |
|开发| 6 | 1 | 2 | 3 | 4 |
| E | 4 | 5 | 1 | 1 | 1 |
+ - + - + - + - + - + - +

下面是在转化规律


  1. 结果数据框是由与 A +(N大柱)其中主要列名通过指定:

     排序(src_df.map(拉姆达×:X [2])不同的()收集())


  2. 结果数据帧包含其中,由提供 A 列中的值 M 行:

     排序(src_df.map(拉姆达×:X [0])不同的()收集())


  3. 对于结果数据框中每个主要列中的值是从源数据框上相应的 A 和主要的价值
    (例如,在源数据框第1行中的计数被映射到,其中 A 一个和列 M1


  4. A 主要的源数据帧的组合没有重复(请考虑它的主在SQL中的两列键)



解决方案

让我们开始示例数据:

  DF = sqlContext.createDataFrame([
    (一,1,M1),(A,1,2),(一,2,M3),
    (一,3,M4),(B,4,M1),(B,1,2),
    (B,2,M3),(C,3,M1),(C,4立方米),
    (C,5,M4),(D,6,M1),(D,1,2),
    (D,2,M3),(D,3,M4),(D,4,M5),
    (E; 4,M1),(E,5,M2),(E,1,M3),
    (E,1,M4),(E,1,M5)],
    (一,CNT,主要))

请注意,我已经改变了计数 CNT 。 Count是大多数SQL方言保留关键字,这是不是一个列名的好选择。

有至少两种方式来重塑这个数据


  • 在汇总数据框

     从pyspark.sql.functions进口山坳时,最大专业=排序(df.select(重大)
        。不同()
        .MAP(拉姆达行:行[0])
        。搜集())COLS = [时(COL(重大)==男,COL(CNT))否则(无).alias(M)
        在专业M]
    MAXS = [在专业最大值(山口(米))。别名(米)为米]reshaped1 =(DF
        。选择(COL(一),* COLS)
        .groupBy(一)
        .agg(* MAXS)
        .na.fill(0))reshaped1.show()## + --- + --- + --- + --- + --- + --- +
    ## | A | M1 | M2 | M3 | M4 | M5 |
    ## + --- + --- + --- + --- + --- + --- +
    ## | A | 1 | 1 | 2 | 3 | 0 |
    ## | C | 4 | 1 | 2 | 0 | 0 |
    ## | B | 3 | 0 | 4 | 5 | 0 |
    ## |开发| 6 | 1 | 2 | 3 | 4 |
    ## | E | 4 | 5 | 1 | 1 | 1 |
    ## + --- + --- + --- + --- + --- + --- +


  • GROUPBY 过RDD

     从pyspark.sql进口排分组=(DF
        .MAP(拉姆达行:(row.a,(row.major,row.cnt)))
        .groupByKey())高清make_row(KV):
        K,VS = KV
        TMP =字典(列表(VS)+(一中,k)])
        返回行(** {K:tmp.get(K,0)在[是] K +专业})reshaped2 = sqlContext.createDataFrame(grouped.map(make_row))reshaped2.show()## + --- + --- + --- + --- + --- + --- +
    ## | A | M1 | M2 | M3 | M4 | M5 |
    ## + --- + --- + --- + --- + --- + --- +
    ## | A | 1 | 1 | 2 | 3 | 0 |
    ## | E | 4 | 5 | 1 | 1 | 1 |
    ## | B | 3 | 0 | 4 | 5 | 0 |
    ## | C | 4 | 1 | 2 | 0 | 0 |
    ## |开发| 6 | 1 | 2 | 3 | 4 |
    ## + --- + --- + --- + --- + --- + --- +


I am a novice to spark, and I want to transform below source dataframe (load from JSON file):

+--+-----+-----+
|A |count|major|
+--+-----+-----+
| a|    1|   m1|
| a|    1|   m2|
| a|    2|   m3|
| a|    3|   m4|
| b|    4|   m1|
| b|    1|   m2|
| b|    2|   m3|
| c|    3|   m1|
| c|    4|   m3|
| c|    5|   m4|
| d|    6|   m1|
| d|    1|   m2|
| d|    2|   m3|
| d|    3|   m4|
| d|    4|   m5|
| e|    4|   m1|
| e|    5|   m2|
| e|    1|   m3|
| e|    1|   m4|
| e|    1|   m5|
+--+-----+-----+

Into below result dataframe:

+--+--+--+--+--+--+
|A |m1|m2|m3|m4|m5|
+--+--+--+--+--+--+
| a| 1| 1| 2| 3| 0|
| b| 4| 2| 1| 0| 0|
| c| 3| 0| 4| 5| 0|
| d| 6| 1| 2| 3| 4|
| e| 4| 5| 1| 1| 1|
+--+--+--+--+--+--+

Here is the Transformation Rule:

  1. The result dataframe is consisted with A + (n major columns) where the major columns names are specified by:

    sorted(src_df.map(lambda x: x[2]).distinct().collect())
    

  2. The result dataframe contains m rows where the values for A column are provided by:

    sorted(src_df.map(lambda x: x[0]).distinct().collect())
    

  3. The value for each major column in result dataframe is the value from source dataframe on the corresponding A and major (e.g. the count in Row 1 in source dataframe is mapped to the box where A is a and column m1)

  4. The combinations of A and major in source dataframe has no duplication (please consider it a primary key on the two columns in SQL)

解决方案

Lets start with example data:

df = sqlContext.createDataFrame([
    ("a", 1, "m1"), ("a", 1, "m2"), ("a", 2, "m3"),
    ("a", 3, "m4"), ("b", 4, "m1"), ("b", 1, "m2"),
    ("b", 2, "m3"), ("c", 3, "m1"), ("c", 4, "m3"),
    ("c", 5, "m4"), ("d", 6, "m1"), ("d", 1, "m2"),
    ("d", 2, "m3"), ("d", 3, "m4"), ("d", 4, "m5"),
    ("e", 4, "m1"), ("e", 5, "m2"), ("e", 1, "m3"),
    ("e", 1, "m4"), ("e", 1, "m5")], 
    ("a", "cnt", "major"))

Please note that I've changed count to cnt. Count is a reserved keyword in most of the SQL dialects and it is not a good choice for a column name.

There are at least two ways to reshape this data:

  • aggregating over DataFrame

    from pyspark.sql.functions import col, when, max
    
    majors = sorted(df.select("major")
        .distinct()
        .map(lambda row: row[0])
        .collect())
    
    cols = [when(col("major") == m, col("cnt")).otherwise(None).alias(m) 
        for m in  majors]
    maxs = [max(col(m)).alias(m) for m in majors]
    
    reshaped1 = (df
        .select(col("a"), *cols)
        .groupBy("a")
        .agg(*maxs)
        .na.fill(0))
    
    reshaped1.show()
    
    ## +---+---+---+---+---+---+
    ## |  a| m1| m2| m3| m4| m5|
    ## +---+---+---+---+---+---+
    ## |  a|  1|  1|  2|  3|  0|
    ## |  b|  4|  1|  2|  0|  0|
    ## |  c|  3|  0|  4|  5|  0|
    ## |  d|  6|  1|  2|  3|  4|
    ## |  e|  4|  5|  1|  1|  1|
    ## +---+---+---+---+---+---+
    

  • groupBy over RDD

    from pyspark.sql import Row
    
    grouped = (df
        .map(lambda row: (row.a, (row.major, row.cnt)))
        .groupByKey())
    
    def make_row(kv):
        k, vs = kv
        tmp = dict(list(vs) + [("a", k)])
        return Row(**{k: tmp.get(k, 0) for k in ["a"] + majors})
    
    reshaped2 = sqlContext.createDataFrame(grouped.map(make_row))
    
    reshaped2.show()
    
    ## +---+---+---+---+---+---+
    ## |  a| m1| m2| m3| m4| m5|
    ## +---+---+---+---+---+---+
    ## |  a|  1|  1|  2|  3|  0|
    ## |  e|  4|  5|  1|  1|  1|
    ## |  c|  3|  0|  4|  5|  0|
    ## |  b|  4|  1|  2|  0|  0|
    ## |  d|  6|  1|  2|  3|  4|
    ## +---+---+---+---+---+---+
    

这篇关于星火数据框中变换多行列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆