猪vs蜂巢vs原生地图减少 [英] Pig vs Hive vs Native Map Reduce

查看:76
本文介绍了猪vs蜂巢vs原生地图减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Pig,Hive抽象是基本的了解。但我对于需要Hive,Pig或原生地图缩减的场景没有清晰的想法。



我经历了几篇基本上指出Hive适用于结构化的处理和猪是非结构化处理。我们什么时候需要原生地图缩小?你能指出一些使用Pig或Hive无法解决的场景,但在本地映射中减少吗?解析方案

复杂的分支逻辑它具有很多嵌套if ... else ...结构在标准MapReduce中更容易和更快地实现,用于处理可以使用 Pangool ,它也简化了像JOIN这样的东西。此外,标准MapReduce可让您完全控制,以最大限度地减少数据处理流程所需的MapReduce作业数量,从而转化为性能。但是它需要更多时间来编写和引入更改。



Apache Pig也适用于结构化数据,但其优势在于能够处理数据BAG(所有行它们被分组在一个关键字上),实现这样的事情比较简单:


  1. 获取每个组的前N个元素;

  2. 计算每个组的总数,然后将该总数与该组中的每一行进行比较;
  3. 使用Bloom筛选器进行JOIN优化;
  4. Multiquery支持(当PIG尝试通过在单个Job中执行更多操作来最大限度地减少MapReduce Jobs上的数量时)

Hive更适合临时查询,但其主要优势在于它具有存储和分区数据的引擎。但它的表格可以从Pig或Standard MapReduce中读取。

另外一件事,Hive和Pig不适合处理分层数据。


I've basic understanding on what Pig, Hive abstractions are. But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce.

I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. When do we need native map reduce? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce?

解决方案

Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool, it also simplifies things like JOIN. Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. But it requires more time to code and introduce changes.

Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like:

  1. Get top N elements for each group;
  2. Calculate total per each group and than put that total against each row in the group;
  3. Use Bloom filters for JOIN optimisations;
  4. Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job)

Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. But its tables can be read from Pig or Standard MapReduce.

One more thing, Hive and Pig are not well suited to work with hierarchical data.

这篇关于猪vs蜂巢vs原生地图减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆