在编写Map / Reduce作业时需要帮助以找到平均值 [英] Need help in writing Map/Reduce job to find average

查看:77
本文介绍了在编写Map / Reduce作业时需要帮助以找到平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Hadoop Map / Reduce相当陌生。我试图编写一个Map / Reduce作业来查找n个进程所花费的平均时间,如下所示:

  ProcessName Time 
process1 10
process2 20
processn 30

我经历了一些教程,但仍然无法得到彻底的理解。我的mapper和reducer类应该为这个问题做些什么?请问我的输出总是一个文本文件,或者有可能直接存储平均值在某种变量?

谢谢。


<您的Mappers读取文本文件并在每一行上应用以下映射函数

 

code> map:(key,value)
time = value [2]
emit(1,time)

所有的map调用都会发出关键字1,这个关键字将被一个reduce函数处理

  reduce:(key,value)
result = sum(value)/ n
emit(1,result)

由于您使用的是Hadoop,因此您可能已经在地图函数中看到过使用StringTokenizer,您可以使用它仅获取一行中的时间。你也可以想出如何计算n(进程数量)的一些方法,例如可以在另一个只计数行的作业中使用Counter。


I'm fairly new to Hadoop Map/Reduce. I'm trying to write a Map/Reduce job to find average time taken by n processes, given an input text file as below:

ProcessName Time
process1    10
process2    20
processn    30

I went through few tutorials but I'm still not able to get a thorough understanding. What should my mapper and reducer classes do for this problem? Will my output always be a text file or is it possible to directly store the average in some sort of a variable?

Thanks.

解决方案

Your Mappers read the text file and apply the following map function on every line

map: (key, value)
  time = value[2]
  emit("1", time)

All map calls emit the key "1" which will be processed by one single reduce function

reduce: (key, value)
  result = sum(value) / n
  emit("1", result)

Since you're using Hadoop, you probably have seen the use of StringTokenizer in the map function, you can use this to get only the time in one line. Also you can think of some ways how to compute n (the number of processes), you could use for example a Counter in another job which just counts lines.

这篇关于在编写Map / Reduce作业时需要帮助以找到平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆