如何在when里面做窗口函数? [英] How to do window function inside a when?

查看:17
本文介绍了如何在when里面做窗口函数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我试图在其中对数组列执行窗口函数.

I have a dataframe where I am trying to do window function on a array column.

逻辑如下:按(或窗口分区)idfiltered列分组.计算types列为空的行的最大分数,否则取该行的分数.当分数不等于组的最大分数时,在列类型中添加NA".

The logic is as follows: Group by (or window partition) the id and filtered columns. Calculate the max score of the rows where the types column is null, otherwise take the score of that row. When score is not equal to the max score of the group add "NA" to the column type.

val data = spark.createDataFrame(Seq(
  (1, "shirt for women", Seq("shirt", "women"), 19.1, "ST"),
  (1, "shirt for women", Seq("shirt", "women"), 10.1, null),
  (1, "shirt for women", Seq("shirt", "women"), 12.1, null),
  (0, "shirt group women", Seq("group", "women"), 15.1, null),
  (0, "shirt group women", Seq("group", "women"), 12.1, null),
  (3, "shirt nmn women", Seq("shirt", "women"), 16.1, "ST"),
  (3, "shirt were women", Seq("shirt", "women"), 13.1, "ST")
)).toDF("id", "raw", "filtered", "score", "types")

  +---+-----------------+--------------+-----+-----+
|id |raw              |filtered      |score|types|
+---+-----------------+--------------+-----+-----+
|1  |shirt for women  |[shirt, women]|19.1 |ST   |
|1  |shirt for women  |[shirt, women]|10.1 |null |
|1  |shirt for women  |[shirt, women]|12.1 |null |
|0  |shirt group women|[group, women]|15.1 |null |
|0  |shirt group women|[group, women]|12.1 |null |
|3  |shirt nmn women  |[shirt, women]|16.1 |ST   |
|3  |shirt were women |[shirt, women]|13.1 |ST   |
+---+-----------------+--------------+-----+-----+

预期输出:

   +---+------------------+--------------+-----+----+
    |id |raw              |filtered      |score|types|
    +---+-----------------+--------------+-----+----+
    |1  |shirt for women  |[shirt, women]|19.1 |ST  |
    |1  |shirt for women  |[shirt, women]|10.1 |NA  |
    |1  |shirt for women  |[shirt, women]|12.1 |null|
    |0  |shirt group women[women, group] |15.1 |null|
    |0  |shirt group women|[women, group]|12.1 |NA  |
    |3  |shirt nmn women  |[shirt, women]|16.1 |ST  |
    |3  |shirt were women |[shirt, women]|13.1 |ST  |
    +---+-----------------+--------------+-----+----+

我试过了:

data.withColumn("max_score",
      when(col("types").isNull,
        max("score")
          .over(Window.partitionBy("id", "filtered")))
        .otherwise($"score"))
      .withColumn("type_temp",
        when(col("score") =!= col("max_score"),
          addReasonsUDF(col("type"),
            lit("NA")))
          .otherwise(col("type")))
      .drop("types", "max_score")
      .withColumnRenamed("type_temp", "types")

但是它不起作用.这给了我:

But it is not working. This gives me:

+---+-----------------+--------------+-----+---------+-----+
|id |raw              |filtered      |score|max_score|types|
+---+-----------------+--------------+-----+---------+-----+
|1  |shirt for women  |[shirt, women]|19.1 |19.1     |ST   |
|1  |shirt women      |[shirt, women]|10.1 |19.1     |NA   |
|1  |shirt of women   |[shirt, women]|12.1 |19.1     |NA   |
|0  |shirt group women|[group, women]|15.1 |15.1     |null |
|0  |shirt will women |[group, women]|12.1 |15.1     |NA   |
|3  |shirt nmn women  |[shirt, women]|16.1 |16.1     |ST   |
|3  |shirt were women |[shirt, women]|13.1 |13.1     |ST   |
+---+-----------------+--------------+-----+---------+-----+

有人能告诉我我在这里做错了什么吗?

Can some one tell me what I am doing wrong here ?

我认为我的窗口函数有问题,当我尝试对 idraw 进行分区时,它也无法正常工作.所以字符串和数组分区都不起作用.

I think something wrong with my window function, when I tried partition against id and raw its not working as well. So both string and array partitions are not working.

dataSet.withColumn("max_score",
      when(col("types").isNull,
        max("score").over(Window.partitionBy("id", "raw")))
        .otherwise($"score")).show(false)

+---+-----------------+--------------+-----+-----+---------+
|id |raw              |filtered      |score|types|max_score|
+---+-----------------+--------------+-----+-----+---------+
|3  |shirt nmn women  |[shirt, women]|16.1 |ST   |16.1     |
|0  |shirt group women|[group, women]|15.1 |null |15.1     |
|0  |shirt group women|[group, women]|12.1 |null |15.1     |
|3  |shirt were women |[shirt, women]|13.1 |ST   |13.1     |
|1  |shirt for women  |[shirt, women]|19.1 |ST   |19.1     |
|1  |shirt for women  |[shirt, women]|10.1 |null |19.1     |
|1  |shirt for women  |[shirt, women]|12.1 |null |19.1     |
+---+-----------------+--------------+-----+-----+---------+

推荐答案

您不需要在 when 表达式中包含窗口函数,而是可以分两个阶段完成.首先根据idfiltered types将最大分数作为新列添加到每个组中.这将专门为 typesnull 的组提供最大分数.由于应保留其他列,因此首选窗口表达式.

You do not need to have a window function inside a when expression, instead this can be done in two stages. First add the max score as a new column to each group based on the id, filtered and the types columns. This will give a max score specifically for the groups where types are null. A window expression is preferred for this since the other columns should be kept.

此后,可以通过when/otherwise 进行检查以更改types 列的值,只要types 有一个 null 值并且最大分数不等于 score.

After this, a check with when/otherwise can be done to change the value of the types column as long as types have a null value and the max score is not equal to score.

在代码中:

val w = Window.partitionBy("id", "filtered", "types")
val df = data.withColumn("max_score", max($"score").over(w))
  .withColumn("types", when($"types".isNull && $"score" =!= $"max_score", "NA").otherwise($"types"))
  .drop("max_score")

这篇关于如何在when里面做窗口函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆