何时使用Hcatalog及其优点 [英] When to use Hcatalog and what are its benefits

查看:1092
本文介绍了何时使用Hcatalog及其优点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Httplog(HCAT)的新手,我们想知道我们使用HCAT的哪些用例/场景,使用HCAT的好处,是否可以从HCatlog获得任何性能改进。任何人都可以提供关于何时使用Hcatlog的信息。Apache Hadoop是一个Hadoop的表和存储管理层,可以使用户使用不同的数据处理工具 - Apache Pig,Apache Map / Reduce和Apache Hive - 可以更轻松地在网格上读取和写入数据。

HCatalog在存储在HDFS集群上的数据上创建一个表抽象层。该表抽象层以熟悉的关系格式呈现数据,并使用熟悉的查询语言概念来读取和写入数据变得更加容易。

HCatalog数据结构使用Hive的数据定义语言(DDL)和Hive Metastore存储HCatalog数据结构。使用命令行界面(CLI),用户可以创建,更改和删除表。如果没有为该表定义表,则将表组织成数据库或放置在默认数据库中。一旦创建了表格,您就可以使用Show Table和Describe Table等命令来浏览表格的元数据。
HCatalog命令与Hive的DDL命令相同。



HCatalog确保用户不必担心数据存储在何处或以什么格式存储。 HCatalog在表格视图中显示来自RCFile格式的数据,文本文件或序列文件。它还提供REST API,以便外部系统可以访问这些表的元数据。



HCatalog打开了其他Map / Reduce工具的配置单元元数据。每个Map / Reduce工具都有自己的关于HDFS数据的概念(例如,Pig将HDFS数据视为一组文件,Hive将其视为表)HCatalog支持的Map / Reduce工具不需要关心数据的存储位置,其中格式和存储位置。


  1. 它有助于与其他工具的集成,并为Pig,Hive和Map / Reduce提供读写接口。 li>
  2. 它为Hadoop工具提供共享模式和数据类型。您不必在每个程序中显式键入数据结构。

  3. 它将信息公开为Rest接口用于外部数据访问。

  4. 它还与Sqoop集成,后者是一种用于在Hadoop和SQL Server和Oracle等关系数据库之间传输数据的工具

  5. 它提供了用于访问hive Metastore中的元数据的API和webservice包装器。

  6. HCatalog还公开了一个REST接口,以便您可以创建自定义工具和应用程序与Hadoop数据结构相结合。

这使我们能够使用正确的工具来完成正确的工作。例如,我们可以使用HCatalog将数据加载到Hadoop中,使用Pig对数据执行一些ETL,然后使用Hive汇总数据。处理完成后,您可以使用Sqoop将数据发送到位于SQL Server中的数据仓库。您甚至可以使用Oozie自动完成此过程。



工作原理:


  • Pig-HCatLoader和HCatStore接口

  • Map / Reduce- HCatInputFormat和HCatOutputFormat接口
  • Hive-无需接口。直接访问元数据

  • 参考文献: > Microsoft大数据解决方案 p>

    http://hortonworks.com/hadoop/hcatalog/

    回答您的问题:

    正如我前面介绍的HCatalog为hadoop工具提供共享模式和数据类型它在数据处理过程中简化了您的工作。如果您使用HCatalog创建了一个表,则可以通过pig或Map / Reduce直接访问该表(您不能简单地通过pig或Map Reduce访问配置单元表)。您无需为每个工具创建模式。 / p>

    如果您正在处理可从多个
    用户使用的共享数据(某些团队使用Hive,某些团队使用猪,某些团队使用Map / Reduce ),那么HCatalog会很有用,因为他们只需要表来访问数据进行处理。



    它不是替代任何工具它提供单一访问的工具许多工具。



    性能取决于您的hadoop群集。您应该在Hadoop群集中进行一些性能基准测试,以获得主要性能。


    I'm new to Hcatlog (HCAT), we would like to know in what usecases/scenario's we use HCAT, Benefits of making use of HCAT, Is there any Performance Improvement can be gain from HCatlog. Can any one just provide information on when to use Hcatlog

    解决方案

    Apache HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Apache Pig, Apache Map/Reduce, and Apache Hive – to more easily read and write data on the grid.

    HCatalog creates a table abstraction layer over data stored on an HDFS cluster. This table abstraction layer presents the data in a familiar relational format and makes it easier to read and write data using familiar query language concepts.

    HCatalog data structures are defined using Hive's data definition language (DDL) and the Hive metastore stores the HCatalog data structures. Using the command-line interface (CLI), users can create, alter, and drop tables. Tables are organized into databases or are placed in the default database if none are defined for the table. Once tables are created, you can explore the metadata of the tables using commands such as Show Table and Describe Table. HCatalog commands are the same as Hive's DDL commands.

    HCatalog’s ensures that users need not worry about where or in what format their data is stored. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

    HCatalog opens up the hive metadata to other Map/Reduce tools. Every Map/Reduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables) HCatalog supported Map/Reduce tools do not need to care about where the data is stored, in which format and storage location.

    1. It assist integration with other tools and supplies read and write interfaces for Pig, Hive and Map/Reduce.
    2. It provide shared schema and data types for Hadoop tools.You do not have to explicitly type the data structures in each program.
    3. It Expose the information as Rest Interface for external data access.
    4. It also integrates with Sqoop, which is a tool designed to transfer data back and forth between Hadoop and relational databases such as SQL Server and Oracle
    5. It provide APIs and webservice wrapper for accessing metadata in hive metastore.
    6. HCatalog also exposes a REST interface so that you can create custom tools and applications to interact with Hadoop data structures.

    This allows us to use the right tool for the right job. For example, we can load data into Hadoop using HCatalog, perform some ETL on the data using Pig, and then aggregate the data using Hive. After the processing, you could then send the data to your data warehouse housed in SQL Server using Sqoop. You can even automate the process using Oozie.

    How it works:

    1. Pig- HCatLoader and HCatStore interface
    2. Map/Reduce- HCatInputFormat and HCatOutputFormat interface
    3. Hive- No Interface Necessary. Direct access to metadata

    References:

    Microsoft Big Data Solution

    http://hortonworks.com/hadoop/hcatalog/

    Answer to your question:

    As I described earlier HCatalog provides shared schema and data types for hadoop tools It simplifies your work during data processing. If you have created a table using HCatalog, you can directly access that hive table through pig or Map/Reduce (you cannot simply access a hive table through pig or Map Reduce).You don't need to create schema for every tool.

    If you are working with the shared data that can be used from multiple users(some team using Hive, some team using pig, some team using Map/Reduce) then HCatalog will be useful as they just need to table only to access the data for processing.

    It is not replacement of any tool It a facility to provide single access to many tools.

    Performance depends on your hadoop cluster. You should do some performance benchmarking in your Hadoop cluster to major performance.

    这篇关于何时使用Hcatalog及其优点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆