'InputFormat, OutputFormat' 和有什么区别?在 Hive 中“存储为"? [英] What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive?

查看:20
本文介绍了'InputFormat, OutputFormat' 和有什么区别?在 Hive 中“存储为"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是大数据的新手,目前正在学习 Hive.我理解 InputFormat & 的概念Hive 中的 OutputFormat 作为 SerDe 的一部分.我还了解到存储为"用于以特定格式存储文件,就像 InputFormat 一样.但我不明白使用 'InputFormat, OutputFormat' & 之间的显着区别是什么?'存储为'.

Im new to Bigdata and currently learning Hive. I understood the concept of InputFormat & OutputFormat in Hive as part of SerDe. I also understood that 'Stored as' is used to store a file in a particular format just like InputFormat. But I don't understand what is the significant difference between using the 'InputFormat, OutputFormat' & 'Stored as'.

感谢任何帮助.

推荐答案

Hive 有很多关于如何存储数据的选项.您可以使用 外部 存储,其中 Hive 只包装来自其他地方的一些数据,或者您可以从一开始就在 hive 仓库 中创建独立表.输入和输出格式允许您指定这两种类型表的原始数据结构或数据的物理存储方式.从您的客户端,您将继续使用 sql 处理表,但在低级别,它可能是文本文件或序列文件或 hbase 表或其他一些数据结构.

Hive has a lot of options of how to store the data. You can either use external storage where Hive would just wrap some data from other place or you can create standalone table from start in hive warehouse. Input and Output formats allows you to specify the original data structure of these two types of tables or how the data will be physically stored. From your client side you will keep working with a table using sql, but on the low level it would be either text file or sequence file or hbase table or some other data structure.

InputFormat 和 OutputFormat - 允许您描述原始数据结构,以便 Hive 可以将其正确映射到表视图

InputFormat and OutputFormat - allows you to describe you the original data structure so that Hive could properly map it to the table view

SerDe - 表示将数据从表视图实际转换为低级输入输出格式结构和相反的类

SerDe - represents the class which performs actual translation of data from table view to the low level input-output format structures and opposite

通常你的过程是这样的:HDFS 文件 --> InputFileFormat --> 反序列化器 --> 行对象 --> 序列化器 --> OutputFileFormat --> HDFS 文件

Generally your process would be like this: HDFS files --> InputFileFormat --> Deserializer --> Row object --> Serializer --> OutputFileFormat --> HDFS files

存储为 - 指定这样的存储格式,包括 Hive 中新表的输入和输出格式

Stored as - specifies such storage format which includes Input and Output formats for you new tables in Hive

这些属性确实可以影响性能、整体大小、数据模式演变支持或启用诸如 ACID 之类的功能.您可以按照本文中描述的步骤查看底层工作情况并获取有关最常用格式的一些一般信息 - https://oyermolenko.blog/2017/02/16/structuring-hadoop-data-through-hive-and-sql

These attributes can really affect the performance, the overall size, data schema evolution support or enable such features as ACID. You can follow the steps described in this article to see things are working on the low level and to get some general information about most commonly used formats - https://oyermolenko.blog/2017/02/16/structuring-hadoop-data-through-hive-and-sql

这篇关于'InputFormat, OutputFormat' 和有什么区别?在 Hive 中“存储为"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆