从结构数组中选择 Spark DataFrames 中的特定列 [英] select specific columns in Spark DataFrames from Array of Struct

查看：26 发布时间：2021/11/14 23:00:40 apache-spark spark-dataframe parquet

本文介绍了从结构数组中选择 Spark DataFrames 中的特定列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个具有以下架构的 Spark DataFrame df:

I have a Spark DataFrame df with the following Schema:

root
 |-- k: integer (nullable = false)
 |-- v: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: integer (nullable = false)
 |    |    |-- b: double (nullable = false)
 |    |    |-- c: string (nullable = true)

是否可以只从 df 中选择 v 中的 a, c 而不做 map?特别是，df 是从 Parquet 文件加载的，我不希望 c 的值甚至被加载/读取.

Is it possible to just select a, c in v from df without doing a map? In particular, df is loaded from a Parquet file and I don't want the values for c to even be loaded/read.

推荐答案

这完全取决于您期望作为输出的内容，这在您的问题中并不清楚.让我澄清一下.你可以这样做

It depends on exactly what you expect as an output, which is not clear from your question. Let me clarify. You can do

df.select($"v.a",$"v.b").show()

然而，结果可能不是你想要的，因为 v 是一个数组，它会为 a 生成一个数组，每个 b 一个.您可能想要做的是 explode 数组 v 然后从分解的数据框中选择:

however, the result may be not what you want, since v is an array, it will yield an array for a and one per b. What you may want to do is explode the array v then select from the exploded dataframe:

df.select(explode($"v").as("v" :: Nil )).select($"v.a", $"v.b").show()

这会将 v 展平到一个表格，其所有值都展平.在任何一种情况下，spark/parquet 应该足够聪明以使用谓词下推而不加载 c.

this will flatten v to a table with all its values flattened. In either case, spark/parquet should be smart enough to use predicate push down and not load c at all.

这篇关于从结构数组中选择 Spark DataFrames 中的特定列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从结构数组中选择 Spark DataFrames 中的特定列 [英] select specific columns in Spark DataFrames from Array of Struct

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从结构数组中选择 Spark DataFrames 中的特定列 [英] select specific columns in Spark DataFrames from Array of Struct

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭