从结构数组中选择 Spark DataFrames 中的特定列 [英] select specific columns in Spark DataFrames from Array of Struct
问题描述
我有一个具有以下架构的 Spark DataFrame df
:
I have a Spark DataFrame df
with the following Schema:
root
|-- k: integer (nullable = false)
|-- v: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: double (nullable = false)
| | |-- c: string (nullable = true)
是否可以只从 df
中选择 v
中的 a, c
而不做 map
?特别是,df
是从 Parquet 文件加载的,我不希望 c
的值甚至被加载/读取.
Is it possible to just select a, c
in v
from df
without doing a map
? In particular, df
is loaded from a Parquet file and I don't want the values for c
to even be loaded/read.
推荐答案
这完全取决于您期望作为输出的内容,这在您的问题中并不清楚.让我澄清一下.你可以这样做
It depends on exactly what you expect as an output, which is not clear from your question. Let me clarify. You can do
df.select($"v.a",$"v.b").show()
然而,结果可能不是你想要的,因为 v
是一个数组,它会为 a 生成一个数组,每个 b 一个.您可能想要做的是 explode
数组 v 然后从分解的数据框中选择:
however, the result may be not what you want, since v
is an array, it will yield an array for a and one per b. What you may want to do is explode
the array v then select from the exploded dataframe:
df.select(explode($"v").as("v" :: Nil )).select($"v.a", $"v.b").show()
这会将 v 展平到一个表格,其所有值都展平.在任何一种情况下,spark/parquet 应该足够聪明以使用谓词下推而不加载 c.
this will flatten v to a table with all its values flattened. In either case, spark/parquet should be smart enough to use predicate push down and not load c at all.
这篇关于从结构数组中选择 Spark DataFrames 中的特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!