什么时候使用 Hadoop、HBase、Hive 和 Pig? [英] When to use Hadoop, HBase, Hive and Pig?

查看:28
本文介绍了什么时候使用 Hadoop、HBase、Hive 和 Pig?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 HadoopHBaseHive 有什么好处?

What are the benefits of using either Hadoop or HBase or Hive ?

根据我的理解,HBase 避免使用 map-reduce,并且在 HDFS 之上有一个面向列的存储.Hive 是用于 HadoopHBase 的类似 sql 的界面.

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.

我还想知道 HivePig 的比较.

I would also like to know how Hive compares with Pig.

推荐答案

MapReduce 只是一个计算框架.HBase 与它无关.也就是说,您可以通过编写 MapReduce 作业高效地向/从 HBase 放入或获取数据.或者,您可以使用其他 HBase API(例如 Java)编写顺序程序来放置或获取数据.但是我们使用 Hadoop、HBase 等来处理海量数据,所以这没有多大意义.当您的数据太大时,使用普通的顺序程序会非常低效.

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

回到问题的第一部分,Hadoop 基本上是两件事:分布式文件系统 (HDFS) + 计算或处理框架 (MapReduce).与所有其他 FS 一样,HDFS 也为我们提供存储,但以容错方式提供高吞吐量和较低的数据丢失风险(由于复制).但是,作为 FS,HDFS 缺乏随机读写访问.这就是 HBase 出现的地方.它是一个分布式、可扩展的大数据存储,以 Google 的 BigTable 为模型.它将数据存储为键/值对.

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

来到蜂巢.它为我们提供了基于现有 Hadoop 集群的数据仓储设施.除此之外,它还提供了一个类似 SQL 的界面,如果您有 SQL 背景,这将使您的工作更轻松.您可以在 Hive 中创建表并在那里存储数据.除此之外,您甚至可以将现有的 HBase 表映射到 Hive 并对其进行操作.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

Pig 基本上是一种数据流语言,它使我们能够非常轻松快速地处理大量数据.Pig 基本上由两部分组成:Pig 解释器和语言PigLatin.您在 PigLatin 中编写 Pig 脚本并使用 Pig 解释器处理它们.Pig 让我们的生活轻松了很多,否则写 MapReduce 总是不容易.事实上,在某些情况下,这真的会成为一种痛苦.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.

我写了一篇关于简短的文章前段时间Hadoop生态系统不同工具的对比.这不是深入的比较,而是对这些工具中的每一个的简短介绍,可以帮助您入门.(只是为了补充我的答案.没有自我宣传的意图)

I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)

Hive 和 Pig 查询都在幕后转换为 MapReduce 作业.

HTH

这篇关于什么时候使用 Hadoop、HBase、Hive 和 Pig?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆