从表模式处理的角度来看,Hive和Parquet之间的SPARK-HIVE键区别 [英] SPARK-HIVE-key differences between Hive and Parquet from the perspective of table schema processing

查看:140
本文介绍了从表模式处理的角度来看,Hive和Parquet之间的SPARK-HIVE键区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是火花和蜂巢的新手.我不明白该声明

I am new in spark and hive. I do not understand the statement

"Hive认为所有列都可为空,而P​​arquet中的可为空性很重要"

"Hive considers all columns nullable, while nullability in Parquet is significant"

如果有人用示例解释声明,对我会更好.谢谢你.

If any one explain the statement with example it will better for me. Thank your.

推荐答案

标准SQL语法中,创建表时,您可以声明特定列为可为空" (即可能包含Null值)或不包含(即尝试插入/更新Null值会引发错误).
Nullable是默认设置.

In standard SQL syntax, when you create a table, you can state that a specific column is "nullable" (i.e. may contain a Null value) or not (i.e. trying to insert/update a Null value will throw an error).
Nullable is the default.

Parquet模式语法支持相同的概念,尽管在使用AVRO序列化时,默认值是不可为空.
警告-使用Spark读取多个Parquet文件时,这些文件可能具有不同的模式.想象一下,架构定义已随着时间而改变,并且较新的文件最后还有2个更多的Nullable列.然后,您必须请求模式合并",以便Spark从所有文件中读取模式(不是随机的)(emem)( ),以确保所有这些模式都是兼容的,然后在读取时显示未定义"对于较旧的文件,列默认为Null.

Parquet schema syntax supports the same concept, although when using AVRO serialization, not-nullable is the default.
Caveat -- when you use Spark to read multiple Parquet files, these files may have different schemas. Imagine that the schema definition has changed over time, and newer files have 2 more Nullable columns at the end. Then you have to request "schema merging" so that Spark reads the schema from all files (not just one at random) to make sure that all these schemas are compatible, then at read-time the "undefined" columns are defaulted to Null for older files.

Hive HQL语法不支持标准的SQL功能;每列都是且必须是可为空的-仅仅是因为Hive对它的数据文件没有完全的控制权!
想象一下一个具有2个分区的Hive分区表...

Hive HQL syntax does not support the standard SQL feature; every column is, and must be, nullable -- simply because Hive does not have total control on its data files!
Imagine a Hive partitioned table with 2 partitions...

  • 一个分区使用TextFile格式,并包含来自的CSV转储 不同的来源,有些显示了所有预期的列,有些则丢失了 最后2列,因为它们使用了较旧的定义
  • 第二个分区使用Parquet格式记录历史记录,该记录由Hive INSERT-SELECT 查询创建,但版本更旧 Parquet文件也缺少最后两列,因为它们是使用较旧的表定义创建的
  • one partition uses TextFile format and contains CSV dumps from different sources, some showing up all expected columns, some missing the last 2 columns because they use an older definition
  • the second partition uses Parquet format for history, created by Hive INSERT-SELECT queries, but older Parquet files are missing the last 2 columns also, because they were created using the older table definition

对于基于Parquet的分区,Hive会进行架构合并",但不是将文件架构合并在一起(如Spark),而是将每个文件架构与表架构合并-忽略表中未定义的列,并且默认为将文件中未存在的所有表列都设为Null.

For the Parquet-based partition, Hive does "schema merging", but instead of merging the file schemas together (like Spark), it merges each file schema with the table schema -- ignoring columns that are not defined in the table, and defaulting to Null all table columns that are not in the file.

请注意,对于基于CSV的分区,它更为残酷,因为CSV文件没有模式"-它们只是具有按顺序映射到表列的值的列表.到达EOL时,所有缺少的列均设置为Null;达到最后一列的值时,该行上的所有多余值都将被忽略.

Note that for the CSV-based partition, it's much more brutal, because the CSV files don't have a "schema" -- they just have a list of values that are mapped to the table columns, in order. On reaching EOL all missing columns are set to Null; on reaching the value for the last column, any extra value on the line is ignored.

这篇关于从表模式处理的角度来看,Hive和Parquet之间的SPARK-HIVE键区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆