Hive在读取期间强制执行模式? [英] Hive enforces schema during read time?

查看:114
本文介绍了Hive在读取期间强制执行模式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这里的演讲中我遇到了这两个陈述的区别和含义:

  1。传统数据库在加载时执行模式。 

  2。 Hive在读取期间强制执行模式。 


解决方案

您触及Hadoop和其他的原因之一NoSQL策略非常成功,所以我不确定你是否期望获得论文,但是在这里!数据分析中额外的灵活性和灵活性可能促成了数据科学的爆炸式增长,这是因为它大大简化了大规模数据分析。关系数据库将数据与模式存储在一起。它知道第二列是一个整数,它知道它有40列等。因此,你需要提前指定你的模式,并且很好地计划好它。这是写入模式 - 也就是说,数据写入数据存储时应用模式。

Hive(在某些情况下),Hadoop ,而其他许多NoSQL系统一般都是关于读取模式 - 在从数据存储读取数据时应用模式。考虑下面一行原始文本:

  A:B:C〜E:F〜G:H ~~ I :: J〜K〜L 

有两种解释方法。 可能是分隔符,或者可能是分隔符。谁知道?在读取模式时,无关紧要。您在分析数据时决定模式的内容,而不是在写入数据时决定。这个例子有点荒谬,因为你可能永远不会遇到这种情况,但它有希望得到点。



在读取模式时,您只需加载数据存入数据存储并考虑如何解析和稍后解释。在这个解释的核心,读取模式意味着先写数据,然后弄清楚它是什么。 写模式 表示找出你的数据先写入,然后写出来。

这里有一个权衡。其中一些是主观的和我自己的意见。



模式写入的好处


  • 更好的类型安全和数据清理为静态数据完成
  • 由于数据已经被解析,通常更高效(存储大小和计算) li>


    的模式不足:


    • 在存储数据之前,您必须提前计划您的模式(即,您必须执行ETL)。通常情况下,您丢弃原始数据,如果您的摄取过程中存在错误,那么这可能很糟糕

    • 很难对同一数据有不同的视图



    读取模式的好处


    • 灵活定义数据的解读方式在加载时间

      • 这使您能够随着时间的推移发展您的架构

      • 这允许您拥有不同的版本Ø f您的模式

      • 这允许更改原始源数据格式,而不必合并为一种数据格式


    • 您可以保留您的原始数据

    • 您可以在知道如何处理数据之前加载数据(因此您不要将它放在地面上)

    • 使您能够灵活地存储非结构化,不洁净和/或无组织数据。



    模式的缺点:


    • 通常效率不高,因为您必须重新解析并重新解释数据(这对于像XML这样的格式来说可能会很昂贵)

    • 数据不是自我记录的(即,您不能查看模式来确定数据是什么) li>
    • 更容易出错并且您的分析必须考虑脏数据


    What is the difference and meaning of these two statements that I encountered during a lecture here:

    1. Traditional databases enforce schema during load time.
    

    and

    2. Hive enforces schema during read time.
    

    解决方案

    You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.

    A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.

    Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:

    A:B:C~E:F~G:H~~I::J~K~L
    

    There are a couple ways to interpret this. ~ could be the delimiter or maybe : could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.

    With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.


    There is a tradeoff here. Some of these are subjective and my own opinion.

    Benefits of schema on write:

    • Better type safety and data cleansing done for the data at rest
    • Typically more efficient (storage size and computationally) since the data is already parsed

    Downsides of schema on write:

    • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)
    • Typically you throw away the original data, which could be bad if you have a bug in your ingest process
    • It's harder to have different views of the same data

    Benefits of schema on read:

    • Flexibility in defining how your data is interpreted at load time
      • This gives you the ability to evolve your "schema" as time goes on
      • This allows you to have different versions of your "schema"
      • This allows the original source data format to change without having to consolidate to one data format
    • You get to keep your original data
    • You can load your data before you know what to do with it (so you don't drop it on the ground)
    • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data

    Downsides of schema on read:

    • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)
    • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)
    • More error prone and your analytics have to account for dirty data

    这篇关于Hive在读取期间强制执行模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆