选择 DataFrame 中数组的最后一个元素 [英] Select the last element of an Array in a DataFrame
问题描述
我正在处理一个项目,我正在处理一些具有复杂架构/数据结构的嵌套 JSON 日期.基本上我想要做的是过滤掉数据框中的其中一列,以便我选择数组中的最后一个元素.我完全被困在如何做到这一点上.我希望这是有道理的.
I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.
以下是我正在尝试完成的示例:
Below is an example of what I'm trying to accomplish:
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
actualDF.show(false)
actualDF.printSchema()
+-------+-----------------------+
|name |hit_songs |
+-------+-----------------------+
|beatles|[help, hey, jude] |
|romeo |[eres, mia] |
|elvis |[this, is, an, example]|
+-------+-----------------------+
root
|-- name: string (nullable = true)
|-- hit_songs: array (nullable = true)
| |-- element: string (containsNull = true)
输出的最终目标如下,选择 hit_songs 数组中的最后一个字符串".
The end goal for the output would be the following, to select the last "string" in the hit_songs array.
我并不担心架构之后会是什么样子.
I'm not worried about what the schema would look like afterwards.
+-------+---------+
|name |hit_songs|
+-------+---------+
|beatles|jude |
|romeo |mia |
|elvis |example |
+-------+---------+
推荐答案
您可以使用 size
函数计算所需项在数组中的索引,然后将其作为参数传递Column.apply
(显式或隐式):
You can use the size
function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply
(explicitly or implicitly):
import org.apache.spark.sql.functions._
import spark.implicits._
actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))
或者:
actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))
这篇关于选择 DataFrame 中数组的最后一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!