按列“ grp”分组并压缩DataFrame-(按列“ ord”排序的每个列的最后一个非空值) [英] Group by column "grp" and compress DataFrame - (take last not null value for each column ordering by column "ord")

查看:89
本文介绍了按列“ grp”分组并压缩DataFrame-(按列“ ord”排序的每个列的最后一个非空值)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我具有以下数据框:

Assuming I have the following DataFrame:

+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null|  3|null|  11|
|  2|    null|  2| xxx|  22|
|  1|    null|  1| yyy|null|
|  2|    null|  7|null|  33|
|  1|    null| 12|null|null|
|  2|    null| 19|null|  77|
|  1|    null| 10| s13|null|
|  2|    null| 11| a23|null|
+---+--------+---+----+----+

这是带有注释的相同样本DF,按 grp ord :

here is the same sample DF with comments, sorted by grp and ord:

scala> df.orderBy("grp", "ord").show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null|  1| yyy|null|
|  1|    null|  3|null|  11|   # grp:1 - last value for `col2` (11)
|  1|    null| 10| s13|null|   # grp:1 - last value for `col1` (s13)
|  1|    null| 12|null|null|   # grp:1 - last values for `null_col`, `ord`
|  2|    null|  2| xxx|  22|   
|  2|    null|  7|null|  33|   
|  2|    null| 11| a23|null|   # grp:2 - last value for `col1` (a23)
|  2|    null| 19|null|  77|   # grp:2 - last values for `null_col`, `ord`, `col2`
+---+--------+---+----+----+

我想压缩它。即将其按列 grp 分组,然后为每个组按 ord 列对行进行排序,每列中最后一个不为空的值(如果有的话)。

I would like to compress it. I.e. to group it by column "grp" and for each group, sort rows by the "ord" column and take the last not null value in each column (if there is one).

+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
|  1|    null| 12| s13|  11|
|  2|    null| 19| a23|  77|
+---+--------+---+----+----+

我见过以下类似问题:

  • How to select the first row of each group?
  • How to find first non-null values in groups? (secondary sorting using dataset api)

,但是我的实际DataFrame有超过250列,所以我需要一个解决方案不必显式地指定所有列。

but my real DataFrame has over 250 columns, so I need a solution where I don't have to specify all the columns explicitly.

我无法绕开它……

MCVE:如何创建示例数据框:


  1. 创建本地文件 /tmp/data.txt,然后复制并粘贴DataFrame的上下文(如上面所述)

  2. 定义函数 readSparkOutput()

  3. parse / tmp / data.txt添加到DataFrame:

  1. create local file "/tmp/data.txt" and copy and paste there a context of the DataFrame (as it's posted above)
  2. define function readSparkOutput():
  3. parse "/tmp/data.txt" to DataFrame:

val df = readSparkOutput("file:///tmp/data.txt")







更新:我认为它应该类似于以下SQL:


UPDATE: I think it should be similar to the following SQL:

SELECT
  grp, ord, null_col, col1, col2
FROM (
    SELECT
      grp,
      ord,
      FIRST(null_col) OVER (PARTITION BY grp ORDER BY ord DESC) as null_col,
      FIRST(col1) OVER (PARTITION BY grp ORDER BY ord DESC) as col1,
      FIRST(col2) OVER (PARTITION BY grp ORDER BY ord DESC) as col2,
      ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ord DESC) as rn
    FROM table_name) as v
WHERE v.rn = 1;




我们如何动态生成这样的火花查询?

how can we dynamically generate such a Spark query?

我尝试了以下简化方法:

import org.apache.spark.sql.expressions.Window

val win = Window
  .partitionBy("grp")
  .orderBy($"ord".desc)

val cols = df.columns.map(c => first(c, ignoreNulls=true).over(win).as(c))

会产生:

scala> cols
res23: Array[org.apache.spark.sql.Column] = Array(first(grp, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `grp`, first(null_col, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `null_col`, first(ord, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `ord`, first(col1, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col1`, first(col2, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col2`)

但我无法将其传递给 df.select

scala> df.select(cols.head, cols.tail: _*).show
<console>:34: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
       df.select(cols.head, cols.tail: _*).show

另一种尝试:

scala> df.select(cols.map(col): _*).show
<console>:34: error: type mismatch;
 found   : String => org.apache.spark.sql.Column
 required: org.apache.spark.sql.Column => ?
       df.select(cols.map(col): _*).show


推荐答案

我会使用与@LeoC相同的方法,但是我认为不需要将列名作为字符串进行操作,而我会使用更类似于spark-sql的答案。

I'd go with same approach like @LeoC, but I believe that there is no need to manipulate column names as string and I would go with a more spark-sql like answer.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, last}

val win = Window.partitionBy("grp").orderBy(col("ord")).rowsBetween(0, Window.unboundedFollowing)

// In case there is more than one group column
val nonAggCols = Seq("grp")

// Select columns to aggregate on
val cols: Seq[String] = df.columns.diff(nonAggCols).toSeq

// Map over selection and apply fct
val aggregations: Seq[Column] = cols.map(c => first(col(c), ignoreNulls = true).as(c))

// I'd rather cache the following step as it might get expensive
val step1 = cols.foldLeft(df)((acc, c) => acc.withColumn(c, last(col(c), ignoreNulls = true).over(win))).cache

// Finally we can aggregate our results as followed
val results = step1.groupBy(nonAggCols.head, nonAggCols.tail: _*).agg(aggregations.head, aggregations.tail: _*)

results.show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// |  1|    null| 12| s13|  11|
// |  2|    null| 19| a23|  77|
// +---+--------+---+----+----+

我希望这会有所帮助。

编辑:原因与您不同结果是因为您使用的阅读器不正确。

The reason you are not getting the same results is because the reader that you are using isn't correct.

它将文件中的 null 解释为字符串,而不是 null ;即:

It interprets null from the file as a string and not a null; i.e :

scala> df.filter('col1.isNotNull).show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// |  1|    null|  3|null|  11|
// |  2|    null|  2| xxx|  22|
// |  1|    null|  1| yyy|null|
// |  2|    null|  7|null|  33|
// |  1|    null| 12|null|null|
// |  2|    null| 19|null|  77|
// |  1|    null| 10| s13|null|
// |  2|    null| 11| a23|null|
// +---+--------+---+----+----+

这是我的 readSparkOutput 版本:

def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
  val step1 = spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .option("delimiter", "|")
    .option("parserLib", "UNIVOCITY")
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    .option("comment", "+")
    .csv(filePath)

  val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)

  val columns = step2.columns
  columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null" or col(c).isNotNull, col(c))))
}

这篇关于按列“ grp”分组并压缩DataFrame-(按列“ ord”排序的每个列的最后一个非空值)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆