选择DataFrame中数组的最后一个元素 [英] Select the last element of an Array in a DataFrame
问题描述
我正在处理一个项目,并且正在处理带有复杂模式/数据结构的一些嵌套JSON日期.基本上,我想做的是在数据框中过滤掉其中的一列,以便选择数组中的最后一个元素.我完全不知道该怎么做.我希望这是有道理的.
I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.
以下是我要完成的工作的一个示例:
Below is an example of what I'm trying to accomplish:
val singersDF = Seq(
("beatles", "help,hey,jude"),
("romeo", "eres,mia"),
("elvis", "this,is,an,example")
).toDF("name", "hit_songs")
val actualDF = singersDF.withColumn(
"hit_songs",
split(col("hit_songs"), "\\,")
)
actualDF.show(false)
actualDF.printSchema()
+-------+-----------------------+
|name |hit_songs |
+-------+-----------------------+
|beatles|[help, hey, jude] |
|romeo |[eres, mia] |
|elvis |[this, is, an, example]|
+-------+-----------------------+
root
|-- name: string (nullable = true)
|-- hit_songs: array (nullable = true)
| |-- element: string (containsNull = true)
输出的最终目标如下,以选择hit_songs数组中的最后一个字符串".
The end goal for the output would be the following, to select the last "string" in the hit_songs array.
我不担心之后的架构如何.
I'm not worried about what the schema would look like afterwards.
+-------+---------+
|name |hit_songs|
+-------+---------+
|beatles|jude |
|romeo |mia |
|elvis |example |
+-------+---------+
推荐答案
您可以使用 size
函数计算数组中所需项目的索引,然后将其作为参数传递给 Column.apply
(显式或隐式):
You can use the size
function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply
(explicitly or implicitly):
import org.apache.spark.sql.functions._
import spark.implicits._
actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))
或者:
actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))
这篇关于选择DataFrame中数组的最后一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!