Hive更改表<表名>串联工作? [英] How does Hive 'alter table <table name> concatenate' work?

查看:108
本文介绍了Hive更改表<表名>串联工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有n(large)个小尺寸的兽人文件,我想合并为k(small)个大的兽人文件.

I have n(large) number of small sized orc files which i want to merge into k(small) number of large orc files.

这是使用Hive中的alter table table_name concatenate命令完成的.

This is done using alter table table_name concatenate command in Hive.

我想了解Hive如何实现这一点. 我希望使用Spark进行此操作,并根据需要进行任何更改.

I want to understand how does Hive implement this. I'm looking to implement this using Spark with any changes if required.

任何指针都很棒.

推荐答案

按照如果表或分区包含许多小的RCFiles或ORC文件,则上述命令会将它们合并为更大的文件.对于RCFile,合并发生在块级别,而对于ORC文件,合并发生在条带级别,从而避免了对数据进行解压缩和解码的开销.

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

ORC条:

ORC文件的主体由一系列条纹组成.条纹是 大(通常〜200MB)并且彼此独立,并且通常 由不同的任务处理.柱状的定义特征 存储格式是每列的数据分别存储 并且从文件中读取数据应与 读取的列数. 在ORC文件中,每一列都存储在多个存储的流中 在文件中彼此相邻.例如,整数列是 表示为两个流PRESENT,每个使用一个 值记录(如果该值非空),以及DATA(记录该值) 非空值.如果条带中的所有列值都不为空, 条带中省略了PRESENT流.对于二进制数据,ORC 使用三个流PRESENT,DATA和LENGTH,它们存储长度 每个值.每种类型的详细信息将在 以下各节.

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read. In ORC files, each column is stored in several streams that are stored next to each other in the file. For example, an integer column is represented as two streams PRESENT, which uses one with a bit per value recording if the value is non-null, and DATA, which records the non-null values. If all of a column's values in a stripe are non-null, the PRESENT stream is omitted from the stripe. For binary data, ORC uses three streams PRESENT, DATA, and LENGTH, which stores the length of each value. The details of each type will be presented in the following subsections.

要在Spark中实施,您可以使用 SparkSQL 在Spark上下文的帮助下:

For implementing in Spark you can use SparkSQL with the help of Spark Context:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

scala> sqlContext.sql("Your_hive_query_here")

这篇关于Hive更改表<表名>串联工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆