蜂巢面试问题中的分区 [英] partitions in hive interview questions

查看:214
本文介绍了蜂巢面试问题中的分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

1)如果分区列中没有数据,那么当您查询该列时,会得到什么错误?

1) If the partitioned column doesn't have data, so when you query on that, what error will you get?

2)如果某些行没有分区列,将如何处理这些行?会不会有数据丢失?

2)If some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?

3)为什么需要对数字列进行存储?我们也可以使用字符串列吗?您将选择什么处理程序以及在什么基础上选择分类列?

3)Why bucketing needs to be done with numeric column? Can we use string column also? what is the process and on what basis you will choose the bucketing column?

4)内部表详细信息也将存储在metastore中吗?还是只存储外部表的详细信息?

4) Will the internal table details will also be stored in the metastore? Or only external table details will be stored?

5)什么类型的查询仅在mapper端运行,而不在reducer中运行,反之亦然?

5) What type of queries, that runs only at mapper side not in reducer and vice versa?

推荐答案

简短答案:

1..如果分区列中没有数据,那么当您对此进行查询时,会遇到什么错误?

1. if the partitioned column doesn't have data, so when u query on that, what error will you get?

Partitioned列是一个名为key=value的文件夹,其中包含数据文件.如果没有数据,则意味着不存在分区文件夹并且表为空,不显示错误,不返回任何数据. 当您使用动态分区在分区列中插入null时,分区列中的所有NULL值(以及所有不符合字段类型的值)均以__HIVE_DEFAULT_PARTITION__加载.如果在这种情况下列类型为数字,则类型转换错误将在选择期间被抛出.诸如此类的事情无法将textWritable转换为IntWritable

Partitioned column in Hive is a folder named key=value with data files inside. And if it has no data, it means no partitions folders exist and the table is empty, no error displayed, no data returned. When you inserting null in partitioned column using dynamic partitioning all NULL values within the partitioning column (and all values which do not conform to the field type) loaded as __HIVE_DEFAULT_PARTITION__ If the column type is numeric in this case then the type cast error will be thrown during select. Something like cannot cast textWritable to IntWritable for example

2..如果某些行没有分区列,将如何处理这些行?会不会有数据丢失?

2. if some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss?

如果不"表示为NULL,然后加载为 HIVE_DEFAULT_PARTITION ,实际上仍然可以获取数据,没有丢失

If "does not have" means NULLs, then loaded as HIVE_DEFAULT_PARTITION Actually it is still possible to get data, no loss happened

3..为什么需要对数字列进行存储? -不需要数字,我们也可以使用字符串列吗? 是的.是什么过程,您将在什么基础上选择时段列??

3. Why bucketing needs to be done with numeric column? -it does not need to be numeric can we use string column also? Yes. what is the process and on what basis you will choose the bucketing column.?

应根据联接/过滤器列选择用于存储区的列.值被散列,分配和排序(聚簇),并且相同的哈希值(在插入覆盖期间)被写入相同的存储桶(文件)中.表DDL中指定了存储区和列的数量.

Columns for bucketing should be chosen based on joins/filter columns. Values are being hashed, distributed and sorted(clustered) and the same hashes are being written (during insert overwrite) in the same buckets(files). The number of buckets and columns are specified in the table DDL.

桶式表和桶式映射联接是一个有点过时的概念,您可以使用DISTRIBUTE BY + sort + ORC来实现相同的目的.这种方法更加灵活.

Bucketed table and bucket-map-join is a bit outdated concept, you can achieve the same using DISTRIBUTE BY + sort + ORC. This approach is more flexible.

4..内部表详细信息也将存储在元存储中吗?还是只存储外部表的详细信息?

4. will the internal table details will also be stored in the metastore? or only external table details will be stored?

无关紧要,不受外部管理.表架构/拨款/统计信息存储在metastore中.

Does not matter external or managed. Table schema/grants/statistics is stored in the metastore.

5.是什么类型的查询,仅在mapper端而不在reducer中运行,反之亦然?

5. what type of queries ,that runs only at mapper side not in reducer and vice versa?

无聚集查询,map-joins(当小表适合内存时),简单列转换(如regexp_replace,split,substr,trim,concat等简单列UDF),WHERE中的过滤器,按-可以执行在映射器上.

Queries without aggregations, map-joins(when small table fits in memory), simple columns transformations (simple column UDFs like regexp_replace, split, substr, trim, concat, etc), filters in WHERE, sort by - can be executed on mapper.

在mapper + reducer上执行UDAF的聚合和分析,通用联接,排序,分发.

Aggregations and analytics, common joins, order by, distribute by, UDAFs are executed on mapper+reducer.

这篇关于蜂巢面试问题中的分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆