实木复合地板使用文件元数据创建蜂巢表 [英] Creating hive table using parquet file metadata

查看：440 发布时间：2016/5/22 16:06:58 scala apache-spark hive parquet

本文介绍了实木复合地板使用文件元数据创建蜂巢表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写了一个数据帧拼花文件。而且，我想用用蜂巢从木元数据读取该文件。

I wrote a DataFrame as parquet file. And, I would like to read the file using Hive using the metadata from parquet.

从写拼花输出写入

_common_metadata  part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  _SUCCESS
_metadata         part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet  part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet

蜂巢表

CREATE  TABLE testhive
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '/home/gz_files/result';



FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified

我如何从拼花文件推断出元数据？

How can I infer the meta data from parquet file?

如果我打开 _common_metadata 下面我有内容，


If I open the _common_metadata I have below content,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata▒{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}

或者如何解析元数据文件？
Or how to parse meta data file?
推荐答案
我有同样的问题。这可能是很难从侧面pratcical不过来实现，如木地板支持的模式演变：
I had the same question. It might be hard to implement from pratcical side though, as Parquet supports schema evolution:
 <一个href=\"http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1\" rel=\"nofollow\">http://www.cloudera.com/content/www/en-us/documentation/archive/impala/2-x/2-0-x/topics/impala_parquet.html#parquet_schema_evolution_unique_1
例如，可以将新列添加到您的表格，你不必触摸数据中已有的表。这只是新的数据文件将有新的元数据（用previous版本兼容）。
For example, you could add a new column to your table and you don't have to touch data that's already in the table. It's only new datafiles will have new metadata (compatible with previous version).
架构合并默认情况下，因为星火1.5.0既然是相对昂贵的操作关闭
 http://spark.apache.org/docs/latest /sql-programming-guide.html#schema-merging 
所以infering最新的架构可能不会像听起来那么简单。虽然快速和肮脏的方法是非常有可能例如从
Schema merging is switched off by default since Spark 1.5.0 since it is "relatively expensive operation"
http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging
So infering most recent schema may not be as simple as it sounds. Although quick-and-dirty approaches are quite possible e.g. by parsing output from
$ parquet-tools schema /home/gz_files/result/000000_0


                        这篇关于实木复合地板使用文件元数据创建蜂巢表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

实木复合地板使用文件元数据创建蜂巢表 [英] Creating hive table using parquet file metadata

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

实木复合地板使用文件元数据创建蜂巢表 [英] Creating hive table using parquet file metadata

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭