Hadoop：应该映射什么和应该减少什么？ [英] Hadoop: what should be mapped and what should be reduced?

查看：77 发布时间：2018/5/31 20:20:04 java hadoop mapreduce

本文介绍了Hadoop：应该映射什么和应该减少什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是我第一次使用map / reduce。我想编写一个处理大型日志文件的程序。例如，如果我正在处理包含{学生，大学和GPA}记录的日志文件，并且想要按大学排列所有学生，那么'map'部分是什么，'reduce'部分是什么？我对这个概念有些困难，尽管已经有了一些教程和例子。

谢谢！

  map：（K1 x V1） - > （K2×V2）列表
 reduce：（K2×V2）列表 - > （K3 x V3）列表

在映射和中间变换阶段的K2值上进行排序如果您的输入形式为

 学生x （College x GPA）

然后，您的映射器应该只是让大学值得到关键：

  map：（s，c，g） - >以大学为新的关键，Hadoop将按大学排序。[/ b] [/ b] [/ b] [/ b] [/ b]为你。那么你的reducer就是一个普通的旧的身份缩减器。
 
 
 如果你在实践中进行排序操作（也就是说，这不是一个家庭作业问题），然后查看 Hive 或猪。这些系统大大简化了这些任务。在特定列上排序变得相当微不足道。但是，为例如您在此处指定的任务编写hadoop streaming作业总是有教育意义的，以便您更好地理解mappers和reducers。
 
This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number of tutorials and examples.

Thanks!
 解决方案 
Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are.  The signatures of map and reduce are
map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list
with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.

If your inputs are of the form
Student x (College x GPA)
Then your mapper should do nothing more than get the College values to the key:
map: (s, c, g) -> [(c, s, g)]
with college as the new key, Hadoop will sort by college for you.  Your reducer then, is just a plain old "identity reducer."

If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig.  These systems drastically simplify these kinds of tasks.  Sorting on a particular column becomes quite trivial.  However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.

                        这篇关于Hadoop：应该映射什么和应该减少什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

Hadoop：应该映射什么和应该减少什么？ [英] Hadoop: what should be mapped and what should be reduced?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Hadoop：应该映射什么和应该减少什么？ [英] Hadoop: what should be mapped and what should be reduced?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭