Hive如何存储数据(从HDFS加载)? [英] How Hive stores the data (loaded from HDFS)?

查看:1171
本文介绍了Hive如何存储数据(从HDFS加载)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Hadoop(HDFS和Hbase)和Hadoop Eco系统(Hive,Pig,Impala等)相当陌生。我对Hadoop组件如NamedNode,DataNode,Job Tracker,Task Tracker以及它们如何协同工作以有效地存储数据有了很好的理解。



While试图了解像Hive这样的数据访问层的基础知识,我需要了解表格数据(在Hive中创建)的存储位置?我们可以在Hive中创建外部和内部表格。由于外部表可以位于HDFS或任何其他文件系统中,因此Hive不会在仓库中存储此类表的数据。内部表怎么样?该表将作为Hadoop集群上其中一个数据节点上的目录创建。一旦我们从本地或HDFS文件系统加载这些表中的数据,是否还有更多文件被创建用于将数据存储在Hive中创建的表中?



例如:


  1. 一个名为test_emp_feedback.csv的示例文件被带到从本地文件系统到HDFS。

  2. 表(emp_feedback)是在Hive中创建的,其结构类似于csv文件结构。这导致在Hadoop集群中创建目录say / users / big_data / hive / emp_feedback

  3. 现在,一旦我创建表并从test_emp_feedback.csv加载emp_feedback表中的数据

Hive是否会在emp_feedback目录中创建文件的副本?是否会导致数据冗余?

解决方案

创建托管表将创建一个与Hive仓库目录中的表名称相同的目录通常位于 / user / hive / warehouse / dbname / tablename )。此外,表结构(Hive元数据)在Metastore(RDBMS / HCat)中创建。



在您将数据加载到表上之前,此目录(与hive仓库下的表名称相同)为空。



可能有2个可能出现的情况。


  1. 如果表是外部数据,则根本不会将数据复制到仓库目录。


  2. 如果表格是托管的(不是外部的),那么当您将数据加载到表格时会移动它( not Copied )从当前HDFS位置到Hive仓库directory9 / user / hive / warehouse //)。 因此,这不会复制数据。

  3. 外部表,除非数据仅由配置单元使用。删除托管表将从HDFS(HIVE仓库)中删除数据。



    HadoopGig


    I am fairly new to Hadoop (HDFS and Hbase) and Hadoop Eco system (Hive, Pig, Impala etc.). I have got a good understanding of Hadoop components such as NamedNode, DataNode, Job Tracker, Task Tracker and how they work in tandem to store the data in efficient manner.

    While trying to understand fundamentals of data access layer such as Hive, I need to understand where exactly a table’s data (created in Hive) gets stored? We can create external and internal table in Hive. As external tables can be in HDFS or any other file system, Hive doesnt store data for such tables in warehouse. What about internal tables? This table will be created as a directory on one of the data nodes on Hadoop Cluster. Once we load data in these tables from local or HDFS file system, are there further files getting created to store data in tables created in Hive?

    Say for example:

    1. A sample file named test_emp_feedback.csv was brought from local file system to HDFS.
    2. A table (emp_feedback) was created in Hive with a structure similar to csv file structure. This lead to creation of a directory in Hadoop cluster say /users/big_data/hive/emp_feedback
    3. Now once I create the table and load data in emp_feedback table from test_emp_feedback.csv

    Is Hive going to create a copy of file in emp_feedback directory? Wont it cause data redundancy?

    解决方案

    Creating a Managed table will create a directory with Same name as table name at Hive warehouse directory(Usually at /user/hive/warehouse/dbname/tablename).Also the table structure(Hive Metadata) is created in the metastore(RDBMS/HCat).

    Before you load the data on the table, this directory(with the same name as table name under hive warehouse) is empty.

    There could be 2 possible scenarios.

    1. If the table is external the data is not copied to warehouse directory at all.

    2. If the table is managed(not external), when you load your data to the table it is moved(not Copied) from current HDFS location to Hive warehouse directory9/user/hive/warehouse//). So this will not replicate the data.

    Caution: It is always advisable to create external table unless the data is only used by hive. Dropping a managed table would delete the data from HDFS(Warehouse of HIVE).

    HadoopGig

    这篇关于Hive如何存储数据(从HDFS加载)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆