Hadoop:应该映射什么和应该减少什么? [英] Hadoop: what should be mapped and what should be reduced?

查看:77
本文介绍了Hadoop:应该映射什么和应该减少什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次使用map / reduce。我想编写一个处理大型日志文件的程序。例如,如果我正在处理包含{学生,大学和GPA}记录的日志文件,并且想要按大学排列所有学生,那么'map'部分是什么,'reduce'部分是什么?我对这个概念有些困难,尽管已经有了一些教程和例子。



谢谢!


  map:(K1 x V1) - > (K2×V2)列表
reduce:(K2×V2)列表 - > (K3 x V3)列表

在映射和中间变换阶段的K2值上进行排序如果您的输入形式为

 学生x (College x GPA)

然后,您的映射器应该只是让大学值得到关键:

  map:(s,c,g) - >以大学为新的关键,Hadoop将按大学排序。[/ b] [/ b] [/ b] [/ b] [/ b]为你。那么你的reducer就是一个普通的旧的身份缩减器。



如果你在实践中进行排序操作(也就是说,这不是一个家庭作业问题),然后查看 Hive 。这些系统大大简化了这些任务。在特定列上排序变得相当微不足道。但是,为例如您在此处指定的任务编写hadoop streaming作业总是有教育意义的,以便您更好地理解mappers和reducers。


This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number of tutorials and examples.

Thanks!

解决方案

Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are

map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list

with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.

If your inputs are of the form

Student x (College x GPA)

Then your mapper should do nothing more than get the College values to the key:

map: (s, c, g) -> [(c, s, g)]

with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old "identity reducer."

If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.

这篇关于Hadoop:应该映射什么和应该减少什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆