选择 DataFrame 中数组的最后一个元素 [英] Select the last element of an Array in a DataFrame

查看:94
本文介绍了选择 DataFrame 中数组的最后一个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个项目,我正在处理一些具有复杂架构/数据结构的嵌套 JSON 日期.基本上我想要做的是过滤掉数据框中的其中一列,以便我选择数组中的最后一个元素.我完全被困在如何做到这一点上.我希望这是有道理的.

I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.

以下是我正在尝试完成的示例:

Below is an example of what I'm trying to accomplish:

val singersDF = Seq(
  ("beatles", "help,hey,jude"),
  ("romeo", "eres,mia"),
  ("elvis", "this,is,an,example")
).toDF("name", "hit_songs")

val actualDF = singersDF.withColumn(
  "hit_songs",
  split(col("hit_songs"), "\\,")
)

actualDF.show(false)
actualDF.printSchema() 

+-------+-----------------------+
|name   |hit_songs              |
+-------+-----------------------+
|beatles|[help, hey, jude]      |
|romeo  |[eres, mia]            |
|elvis  |[this, is, an, example]|
+-------+-----------------------+
root
 |-- name: string (nullable = true)
 |-- hit_songs: array (nullable = true)
 |    |-- element: string (containsNull = true)

输出的最终目标如下,选择 hit_songs 数组中的最后一个字符串".

The end goal for the output would be the following, to select the last "string" in the hit_songs array.

我并不担心架构之后会是什么样子.

I'm not worried about what the schema would look like afterwards.

+-------+---------+
|name   |hit_songs|
+-------+---------+
|beatles|jude     |
|romeo  |mia      |
|elvis  |example  |
+-------+---------+

推荐答案

您可以使用 size 函数计算所需项在数组中的索引,然后将其作为参数传递Column.apply(显式或隐式):

You can use the size function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply (explicitly or implicitly):

import org.apache.spark.sql.functions._
import spark.implicits._

actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))

或者:

actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))

这篇关于选择 DataFrame 中数组的最后一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆