不使用 Spark 从 Parquet 读取特定列 [英] Read specific column from Parquet without using Spark

查看:139
本文介绍了不使用 Spark 从 Parquet 读取特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在不使用 Apache Spark 的情况下读取 Parquet 文件,我能够做到,但我发现很难读取特定的列.我找不到任何好的谷歌资源,因为几乎所有的帖子都是关于阅读镶木地板文件的.下面是我的代码:

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:

import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader

object parquetToJson{
  def main (args : Array[String]):Unit= {
 //case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}

注释掉的 case 类代表我的文件的架构,现在编写上面的代码从文件中读取所有列.我想阅读特定的专栏.

The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.

推荐答案

如果您只想读取特定的列,那么您需要在 ParquetReader 构建器接受的配置上设置读取模式.(这也称为投影).

If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).

在您的情况下,您应该能够在 AvroParquetReader 构建器类上调用 .withConf(conf),并在您传入的 conf 中调用 conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) 其中 schema 是字符串形式的 avro 模式.

In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

这篇关于不使用 Spark 从 Parquet 读取特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆