Hive更改表<表名>串联工作? [英] How does Hive 'alter table <table name> concatenate' work?

查看：108 发布时间：2020/11/22 2:03:50 hadoop hive hiveql orc

本文介绍了Hive更改表<表名>串联工作?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有n(large)个小尺寸的兽人文件，我想合并为k(small)个大的兽人文件.

I have n(large) number of small sized orc files which i want to merge into k(small) number of large orc files.

这是使用Hive中的alter table table_name concatenate命令完成的.

This is done using alter table table_name concatenate command in Hive.

我想了解Hive如何实现这一点. 我希望使用Spark进行此操作，并根据需要进行任何更改.

I want to understand how does Hive implement this. I'm looking to implement this using Spark with any changes if required.

任何指针都很棒.

If the table or partition contains many small RCFiles or ORC files, then the above command will merge them into larger files. In case of RCFile the merge happens at block level whereas for ORC files the merge happens at stripe level thereby avoiding the overhead of decompressing and decoding the data.

也 ORC条:

ORC文件的主体由一系列条纹组成.条纹是大(通常〜200MB)并且彼此独立，并且通常由不同的任务处理.柱状的定义特征存储格式是每列的数据分别存储并且从文件中读取数据应与读取的列数. 在ORC文件中，每一列都存储在多个存储的流中在文件中彼此相邻.例如，整数列是表示为两个流PRESENT，每个使用一个值记录(如果该值非空)，以及DATA(记录该值) 非空值.如果条带中的所有列值都不为空，条带中省略了PRESENT流.对于二进制数据，ORC 使用三个流PRESENT，DATA和LENGTH，它们存储长度每个值.每种类型的详细信息将在以下各节.

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read. In ORC files, each column is stored in several streams that are stored next to each other in the file. For example, an integer column is represented as two streams PRESENT, which uses one with a bit per value recording if the value is non-null, and DATA, which records the non-null values. If all of a column's values in a stripe are non-null, the PRESENT stream is omitted from the stripe. For binary data, ORC uses three streams PRESENT, DATA, and LENGTH, which stores the length of each value. The details of each type will be presented in the following subsections.

要在Spark中实施，您可以使用 SparkSQL 在Spark上下文的帮助下:

For implementing in Spark you can use SparkSQL with the help of Spark Context:

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

scala> sqlContext.sql("Your_hive_query_here")

这篇关于Hive更改表<表名>串联工作?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Hive更改表<表名>串联工作? [英] How does Hive 'alter table <table name> concatenate' work?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Hive更改表&lt;表名&gt;串联工作? [英] How does Hive &#39;alter table &lt;table name&gt; concatenate&#39; work?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

Hive更改表<表名>串联工作? [英] How does Hive 'alter table <table name> concatenate' work?

登录关闭