选择DataFrame中数组的最后一个元素 [英] Select the last element of an Array in a DataFrame

查看:80
本文介绍了选择DataFrame中数组的最后一个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个项目,并且正在处理带有复杂模式/数据结构的一些嵌套JSON日期.基本上,我想做的是在数据框中过滤掉其中的一列,以便选择数组中的最后一个元素.我完全不知道该怎么做.我希望这是有道理的.

I'm working on a project and I'm dealing with some nested JSON date with a complicated schema/data structure. Basically what I want to do is filter out one of the columns, in a dataframe, such that I select the last element in the array. I'm totally stuck on how to do this. I hope this make sense.

以下是我要完成的工作的一个示例:

Below is an example of what I'm trying to accomplish:

val singersDF = Seq(
  ("beatles", "help,hey,jude"),
  ("romeo", "eres,mia"),
  ("elvis", "this,is,an,example")
).toDF("name", "hit_songs")

val actualDF = singersDF.withColumn(
  "hit_songs",
  split(col("hit_songs"), "\\,")
)

actualDF.show(false)
actualDF.printSchema() 

+-------+-----------------------+
|name   |hit_songs              |
+-------+-----------------------+
|beatles|[help, hey, jude]      |
|romeo  |[eres, mia]            |
|elvis  |[this, is, an, example]|
+-------+-----------------------+
root
 |-- name: string (nullable = true)
 |-- hit_songs: array (nullable = true)
 |    |-- element: string (containsNull = true)

输出的最终目标如下,以选择hit_songs数组中的最后一个字符串".

The end goal for the output would be the following, to select the last "string" in the hit_songs array.

我不担心之后的架构如何.

I'm not worried about what the schema would look like afterwards.

+-------+---------+
|name   |hit_songs|
+-------+---------+
|beatles|jude     |
|romeo  |mia      |
|elvis  |example  |
+-------+---------+

推荐答案

您可以使用 size 函数计算数组中所需项目的索引,然后将其作为参数传递给 Column.apply (显式或隐式):

You can use the size function to calculate the index of the desired item in the array, and then pass this as the argument of Column.apply (explicitly or implicitly):

import org.apache.spark.sql.functions._
import spark.implicits._

actualDF.withColumn("hit_songs", $"hit_songs".apply(size($"hit_songs").minus(1)))

或者:

actualDF.withColumn("hit_songs", $"hit_songs"(size($"hit_songs").minus(1)))

这篇关于选择DataFrame中数组的最后一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆