Hive查询语言中的主键和索引是否可行? [英] Primary keys and indexes in Hive query language is poosible or not?

查看:164
本文介绍了Hive查询语言中的主键和索引是否可行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试将oracle表迁移到配置单元并对其进行处理.当前,oracle中的表具有主键外键唯一键约束.

We are trying to migrate oracle tables to hive and process them. Currently the tables in oracle has primary key,foreign key and unique key constraints.

我们可以在蜂巢中复制相同的东西吗?

Can we replicate the same in hive?

我们正在对实现方法进行一些分析.

We are doing some analysis on how to implement it.

推荐答案

配置单元Hive 0.7.0( HIVE-417 )并在Hive 3.0中删除( HIVE-18448 ),请阅读此注释吉拉该功能在Hive中完全没有用.这些索引对于大数据来说太昂贵了,RIP.

Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.

自Hive 2.1.0起( HIVE-13290 )配置单元包括对未验证的主键和外键约束的支持.这些约束没有得到验证,上游系统需要确保在将数据加载到Hive中之前的数据完整性.这些约束对于生成ER图和查询的工具很有用.同样,这种未经验证的约束也可以用作自我记录.如果表有这样的约束,您可以轻松找出应该是什么PK.

As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.

在Oracle数据库Unique中,PK和FK约束由索引支持,因此它们可以快速工作并且非常有用.但这不是Hive的工作方式以及它的设计目的.

In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.

在正常情况下是在HDFS中使用半结构化数据加载非常大的文件时.在其上建立索引太昂贵,并且没有索引来检查PK违例仅可能扫描所有数据.通常,您不能在BigData中实施约束.上游过程可以照顾到数据的完整性和一致性,但这不能保证您最终不会在Hive中从其他来源加载的某个大表中违反PK.

Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.

某些文件存储格式,例如 ORC 具有内部轻量级索引"以加快速度-up过滤和启用谓词下推(PPD),使用此类索引不会实现PK和FK约束.之所以无法这样做,是因为通常您可以在Hive中拥有许多属于同一表的文件,甚至文件可以具有不同的架构.为PB创建的Hive,您可以单次运行PB,数据可以是半结构化的,文件可以具有不同的架构.Hadoop不支持随机写入,如果您要重建索引,这会增加更多的复杂性和成本.

Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.

这篇关于Hive查询语言中的主键和索引是否可行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆