从Spark中的镶木地板文件读取特定列的有效方法 [英] Efficient way to read specific columns from parquet file in spark

查看:43
本文介绍了从Spark中的镶木地板文件读取特定列的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从具有许多列的镶木地板文件中仅读取spark中的列的子集的最有效方法是什么?最佳方法是使用 spark.read.format("parquet").load(< parquet>).s​​elect(... col1,col2)吗?我还希望使用带有案例类的类型安全数据集来预定义我的架构,但不确定.

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

推荐答案

val df = spark.read.parquet("fs://path/file.parquet").select(...)

这只会读取相应的列.的确,镶木地板是一种圆柱状的储物柜,它正是用于这种用例的.尝试运行 df.explain ,spark会告诉您仅读取相应的列(它会打印执行计划). explain 还将告诉您在使用where条件的情况下将哪些过滤器下推到物理执行计划中.最后,使用以下代码将数据框(行的数据集)转换为案例类的数据集.

This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.

case class MyData...
val ds = df.as[MyData]

这篇关于从Spark中的镶木地板文件读取特定列的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆