Scala/Spark - 如何获取所有子数组的第一个元素 [英] Scala/Spark - How to get first elements of all sub-arrays

查看:44
本文介绍了Scala/Spark - 如何获取所有子数组的第一个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Spark 中有以下 DataFrame(我使用的是 Scala):

I have the following DataFrame in a Spark (I'm using Scala):

[[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]]

我想得到一个只有每个子数组的第一个整数的数据框,比如:

I want to get a Dataframe with only first Ints of each sub-array, something like:

[1003014, 15, 754, 1029530, 3066, 1066440, ...]

因此只保留上面列出的数组的每个子数组 x 的 x[0].

Keeping hence only the x[0] of each sub-array x of the Array listed above.

我是 Scala 的新手,找不到合适的匿名映射函数.在此先感谢您的帮助

I'm new to Scala, and couldn't find the right anonymous map function. Thanks in advance for any help

推荐答案

对于 Spark >= 2.4,您可以使用 高阶函数 transform 使用 lambda 函数提取每个值数组的第一个元素.

For Spark >= 2.4, you can use Higher-Order Function transform with lambda function to extract the first element of each value array.

scala> df.show(false)

+----------------------------------------------------------------------------------------+
|arrays                                                                                  |
+----------------------------------------------------------------------------------------+
|[[1003014.0, 0.95266926], [15.0, 0.9484202], [754.0, 0.94236785], [1029530.0, 0.880922]]|
+----------------------------------------------------------------------------------------+

scala> df.select(expr("transform(arrays, x -> x[0])").alias("first_array_elements")).show(false)

+-----------------------------------+
|first_array_elements               |
+-----------------------------------+
|[1003014.0, 15.0, 754.0, 1029530.0]|
+-----------------------------------+

火花<2.4

分解初始数组,然后用collect_list聚合,收集每个子数组的第一个元素:

Explode the initial array and then aggregate with collect_list to collect the first element of each sub array:

df.withColumn("exploded_array", explode(col("arrays")))
  .agg(collect_list(col("exploded_array")(0)))
  .show(false)

如果数组包含结构而不是子数组,只需更改结构元素使用点的访问方法:

In case the array contains structs and not sub-arrays, just change the accessing method using dots for struct elements:

val transfrom_expr = "transform(arrays, x -> x.canonical_id)"
df.select(expr(transfrom_expr).alias("first_array_elements")).show(false)

这篇关于Scala/Spark - 如何获取所有子数组的第一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆