配置单元数据和元存储库如何相互通信和集成? [英] How hive data and metastore communicate and integrate with each other?

查看:79
本文介绍了配置单元数据和元存储库如何相互通信和集成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是蜂巢/强子的新手.我阅读/观看了文档,以及有关蜂巢,hdfs和hadoop内部工作方式的视频.但是我仍然无法立即了解一些事情. w.k.t Hive数据作为文件存储在hdfs中,表结构(模式)存储在metastore中.

I am new to hive/hadoop. I read/watched documentations, videos related to how hive, hdfs, hadoop works internally. But I still could not understand few things right off the bat. w.k.t Hive data is stored as files in hdfs and table structure (schema) is stored in metastore.

  1. 因此,配置单元是在查询执行期间处于只读状态的架构,数据和架构相互集成并产生结果.请确认我对这句话的理解是正确的吗?

  1. As, hive is schema on read only during the query execution time, the data and the schema integrates with each other and produces the result. Please confirm is my understanding correct on this statement?

正如语句1告诉我们有关集成的信息,集成是如何发生的?就像存储在hdfs中的文件(实际数据)没有架构权限一样. mapreduce/hadoop/hive如何知道存储在文件中的此特定数据"属于表的此特定列".会不会有数据不匹配?

As the statement 1 tells us about the integration, how does the integration happens? Like the files (actual data) stored in hdfs does not have schema right. How does mapreduce/hadoop/hive know, that "this particular data stored in the file" belongs to "this particular column of the table". Would not there be a data mis-match?

我会考虑一下蜂巢数据文件的样子,

I would think-off hive data files would look like,

students.txt
-------------
1 abc m@gmail.com
-------------------
2 xyz@ymail.com
---------------

以上文件不存储架构.因此,对于具有s_id 2的学生,不会存储该名称.这些东西是如何被捕获的?查询何时执行?我认为xyz@gmail.com不会集成在student_name字段下.但是仍然想知道整合是如何发生的吗?

the above file does not store schema. Hence for student with s_id 2, the name is not stored. How those things are captured? when the query is executed? I don't think the xyz@gmail.com will be integrated under student_name field. But still would like to know how the integration happens?

推荐答案

您对配置单元数据存储为hdfs中的文件,表结构(模式)存储在metastore中"的理解.是正确的.但是,除了架构之外,Metastore还具有存储表数据的HDFS目录详细信息.该HDFS路径信息在执行时由查询使用.

Your understanding with respect to "Hive data is stored as files in hdfs and table structure (schema) is stored in metastore." is correct. But in addition to schema, Metastore also has the HDFS directory details where the table data is stored. This HDFS path information is used by queries at the time of execution.

您的理解和我的确认/答案:

Your understanding and my validations/answers:

  1. 因此,配置单元是在查询执行期间处于只读状态的架构,数据和架构相互集成并产生结果.请确认我对这句话的理解是正确的吗?

答案:正确

  1. 正如语句1告诉我们有关集成的信息那样,集成如何发生?就像存储在hdfs中的文件(实际数据)没有架构权限一样. mapreduce/hadoop/hive如何知道存储在文件中的此特定数据"属于表的此特定列".会不会有数据不匹配?

答案:

作为表一部分存储在HDFS上的文本文件之类的文件将没有结构或列名,而仅具有数据.但是,在创建表时,我们将必须明确提及这些列及其在文本文件中的存储方式.假设2列和以逗号分隔的数据将具有如下查询,

Files such as text files those are stored on HDFS which are part of a table won't have structure or column names in it but just the data. But, when the table is created, we will have to clearly mention the columns and how they are being stored in the text files. Let's say 2 columns and comma delimited data will have a query like below,

create table default.column_test 
(name string,
email string)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','

因此,数据文件(即HDFS路径中存在的文本文件)应具有以下格式的数据.

So the data file i.e., the text file present in the HDFS path should have the data in below format.

alpha,alpha@email.com
beta,beta@email.com

使用SELECT查询从该表中查询数据时,查询将首先被编译,并且将从Hive Metastore派生的HDFS路径中的数据上执行.

When the data is queried from this table using a SELECT query, the query will get compiled at first and execution will happen on the data from the HDFS path which is derived from the Hive Metastore.

SELECT * FROM column_test;

    column_test.name    column_test.email

1   Alpha               alpha@email.com
2   Beta                beta@email.com

如果文件中的数据没有列名的值(如下所示)

If the data in the file does not have value for the column name, (like below)

alpha@email.com
beta,beta@email.com

然后,SELECT查询将把'alpha@email.com'作为列"name"的值,并为第一条记录返回NULL作为列"email"的值.输出将如下所示,

then the SELECT query will consider 'alpha@email.com' as the value for the column "name" and would return NULL as the value for the column "email" for the first record. The output will look like the below,

SELECT * FROM column_test;

    column_test.name    column_test.email

1   alpha@email.com     NULL
2   Beta                beta@email.com

希望有帮助!

这篇关于配置单元数据和元存储库如何相互通信和集成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆