Hadoop:应该映射什么和应该减少什么? [英] Hadoop: what should be mapped and what should be reduced?
问题描述
这是我第一次使用map / reduce。我想编写一个处理大型日志文件的程序。例如,如果我正在处理包含{学生,大学和GPA}记录的日志文件,并且想要按大学排列所有学生,那么'map'部分是什么,'reduce'部分是什么?我对这个概念有些困难,尽管已经有了一些教程和例子。
谢谢!
map:(K1 x V1) - > (K2×V2)列表
reduce:(K2×V2)列表 - > (K3 x V3)列表
在映射和中间变换阶段的K2值上进行排序如果您的输入形式为
学生x (College x GPA)
然后,您的映射器应该只是让大学值得到关键:
map:(s,c,g) - >以大学为新的关键,Hadoop将按大学排序。[/ b] [/ b] [/ b] [/ b] [/ b]为你。那么你的reducer就是一个普通的旧的身份缩减器。
如果你在实践中进行排序操作(也就是说,这不是一个家庭作业问题),然后查看 Hive 或猪。这些系统大大简化了这些任务。在特定列上排序变得相当微不足道。但是,为例如您在此处指定的任务编写hadoop streaming作业总是有教育意义的,以便您更好地理解mappers和reducers。
This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number of tutorials and examples.
Thanks!
解决方案 Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are
map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list
with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.
If your inputs are of the form
Student x (College x GPA)
Then your mapper should do nothing more than get the College values to the key:
map: (s, c, g) -> [(c, s, g)]
with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old "identity reducer."
If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.
这篇关于Hadoop:应该映射什么和应该减少什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!