生成实木复合地板文件的元数据 [英] Generate metadata for parquet files

查看:77
本文介绍了生成实木复合地板文件的元数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个配置单元表,该配置表是在大量外部镶木地板文件的基础上构建的.实木复合地板文件应由spark作业生成,但是由于将元数据标志设置为false,因此未生成它们.我想知道是否有可能以某种轻松的方式恢复它.文件的结构如下:

I have a hive table that is built on top of a load of external parquet files. Parquet files should be generated by the spark job, but due to setting metadata flag to false they were not generated. I'm wondering if it is possible to restore it in some painless way. The structure of files is like follows:

/apps/hive/warehouse/test_db.db/test_table/_SUCCESS
/apps/hive/warehouse/test_db.db/test_table/_common_metadata
/apps/hive/warehouse/test_db.db/test_table/_metadata
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-20
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-21
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-22
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-23
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-24
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-25
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-26
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-27
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-28
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-29
/apps/hive/warehouse/test_db.db/test_table/end_date=2016-04-30

让我们假设文件_metadata不存在或过时.有没有一种方法可以通过hive命令重新创建/生成它,而不必启动整个spark工作?

Let's assume that the file _metadata is non-existing or outdated. Is there a way to recreate it via hive command/generate it without having to start the whole spark job?

推荐答案

好吧,这就是演练,可以使用Parquet工具直接访问元数据.您需要先获取镶木地板文件的页脚:

Ok so here is the drill, metadata can be accessed directly using Parquet tools. You'll need to get the footers for your parquet file first :

import scala.collection.JavaConverters.{collectionAsScalaIterableConverter, mapAsScalaMapConverter}

import org.apache.parquet.hadoop.ParquetFileReader
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.conf.Configuration

val conf = spark.sparkContext.hadoopConfiguration

def getFooters(conf: Configuration, path: String) = {
  val fs = FileSystem.get(conf)
  val footers = ParquetFileReader.readAllFootersInParallel(conf, fs.getFileStatus(new Path(path)))
  footers
}

现在,您可以按照以下步骤获取文件元数据:

Now you can get your file metadata as followed :

def getFileMetadata(conf: Configuration, path: String) = {
  getFooters(conf, path)
    .asScala.map(_.getParquetMetadata.getFileMetaData.getKeyValueMetaData.asScala)
}

现在您可以获取镶木地板文件的元数据:

Now you can get the metadata of your parquet file :

getFileMetadata(conf, "/tmp/foo").headOption

// Option[scala.collection.mutable.Map[String,String]] =
//   Some(Map(org.apache.spark.sql.parquet.row.metadata ->
//     {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{"foo":"bar"}}
//     {"name":"txt","type":"string","nullable":true,"metadata":{}}]}))

我们还可以在需要时使用提取的页脚来编写独立的元数据文件:

We can also use extracted footers to write standalone metadata file when needed:

import org.apache.parquet.hadoop.ParquetFileWriter

def createMetadata(conf: Configuration, path: String) = {
  val footers = getFooters(conf, path)
  ParquetFileWriter.writeMetadataFile(conf, new Path(path), footers)
}

我希望这能回答您的问题.您可以在 awesome-spark

I hope this answers your question. You can read more about Spark DataFrames and Metadata on awesome-spark's spark-gotchas repo.

这篇关于生成实木复合地板文件的元数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆