何时使用Hadoop,HBase,Hive和Pig? [英] When to use Hadoop, HBase, Hive and Pig?

查看:165
本文介绍了何时使用Hadoop,HBase,Hive和Pig?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Hadoop HBase Hive ?的好处是什么?

<根据我的理解, HBase 避免使用map-reduce,并且在HDFS之上有一个面向列的存储。 Hive 是一个类似于sql的 Hadoop HBase 界面。



我也想知道 Hive Pig 的比较。

解决方案

MapReduce只是一个计算框架。 HBase与它无关。也就是说,通过编写MapReduce作业,您可以高效地将数据发送到HBase或从HBase获取数据。或者,您可以使用其他HBase API(如Java)编写顺序程序来放入或取出数据。但是我们使用Hadoop,HBase等来处理大量的数据,所以这没有多大意义。使用正常的顺序程序在数据过于庞大时效率会非常低。回到问题的第一部分,Hadoop基本上是两件事:分布式文件系统(HDFS) + <计算或处理框架(MapReduce)。像所有其他FS一样,HDFS也为我们提供存储,但采用容错方式,具有高吞吐量和较低的数据丢失风险(由于复制)。但是,作为FS,HDFS缺少随机读取和写入权限。这就是HBase进入图片的地方。这是一个分布式,可扩展的大数据存储,模仿Google的BigTable。它将数据存储为键/值对。



来到Hive。它为我们提供了在现有Hadoop集群顶部的数据仓储设施。除此之外,它还提供了一个 SQL like 接口,使您的工作更轻松,以防来自SQL背景。您可以在Hive中创建表并在那里存储数据。除此之外,您甚至可以将现有的HBase表映射到Hive并对它们进行操作。

Pig基本上是一种数据流语言,它允许我们能够非常简单快速地处理大量数据。猪主要有两部分:猪口译口译和语言 PigLatin 。你在PigLatin中编写Pig脚本并使用Pig解释器处理它们。猪让我们的生活变得更轻松,否则编写MapReduce并不容易。事实上,在某些情况下,它真的会变成一种痛苦。



我写了一篇。这不是一个深入的比较,而是每个这些工具的简短介绍,可以帮助您开始。
(只是添加到我的答案中,没有自我推销的意图)

Hive和Pig查询都转化为底层的MapReduce作业。

HTH


What are the benefits of using either Hadoop or HBase or Hive ?

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase.

I would also like to know how Hive compares with Pig.

解决方案

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.

I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)

Both Hive and Pig queries get converted into MapReduce jobs under the hood.

HTH

这篇关于何时使用Hadoop,HBase,Hive和Pig?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆