整合MapReduce日志 [英] Consolidate MapReduce logs

查看：106 发布时间：2018/5/31 20:13:48 logging hadoop mapreduce

本文介绍了整合MapReduce日志的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

调试Hadoop map-reduce作业是一件痛苦的事情。我可以打印出标准输出，但这些日志显示在运行MR作业的所有不同机器上。我可以找工作跟踪者，找到我的工作，然后点击每个单独的映射器以获取它的任务日志，但是当您有20多个映射器/缩减器时，这非常麻烦。

我想我可能需要编写一个脚本，通过作业跟踪器了解每台映射器/缩减器在哪台机器上运行，然后将日志返回到一个中心位置，编辑在一起。在我浪费时间做这件事之前，是否有人知道一个更好的方法来获得一个，整合stdout日志以供作业的映射器和缩减器？ >
所以我最终创建了一个Python脚本来完成这个任务。这并不可怕。这是脚本，以防其他人想要使用它。显然它需要更多的错误检查，而不是硬编码的URL等，但你明白了。请注意，您需要下载美丽的汤
＃！/ usr / bin / python 从bs4导入sys 导入BeautifulSoup作为BS $ b $ from urllib2导入urlopen import re TRACKER_BASE_URL ='http://my.tracker.com:50030/' trackerURLformat = TRACKER_BASE_URL +'jobtasks.jsp？jobid =％s& type =％s& amp ; pagenum = 1'＃使用map或reduce类型 def findLogs（url）： finalLog = print寻找工作： + url html = urlopen（url）.read（） trackerSoup = BS（html） taskURLs = [h.get（'href'）for h in trackerSoup.find_all（href = re.compile（'taskdetails'））] ＃现在我们知道了所有任务的位置，在taskURLs中找到他们的日志 logURLs = [] for taskURL： taskHTML = urlopen（TRACKER_BASE_URL + taskURL）.read（） taskSoup = BS（taskHTML） allLo gURL = taskSoup.find（href = re.compile（'all = true'））.get（'href'） logURLs.append（allLogURL）＃现在获取标准输出日志（logURL）.read（） logSoup = BS（logHTML） stdoutText = logSoup.body.pre.text.lstrip （） finalLog + = stdoutText return finalLog def main（argv）： with open（argv [1] + -map-stdout.log，w）为f： f.write（findLogs（trackerURLformat％（argv [1]，map））） print写mapers stdouts为+ f.name with open（argv [1] +-reduce-stdout.log，w）as f： f.write（findLogs（trackerURLformat％如果__name__ ==__main__： main（sys。argv [1]，reduce））） print argv）

Debugging Hadoop map-reduce jobs is a pain. I can print out to stdout, but these logs show up on all of the different machines on which the MR job was run. I can go to the jobtracker, find my job, and click on each individual mapper to get to its task log, but this is extremely cumbersome when you have 20+ mapper/reducers.

I was thinking that I might have to write a script that would scape through the job tracker to figure out what machine each of the mappers/reducers ran on and then scp the logs back to one central location where they could be cat'ed together. Before I waste my time doing this, does someone know of a better way to get one, consolidated stdout log for a job's mappers and reducers?
解决方案
So I ended up just creating a Python script to do this. It wasn't horrible. Here's the script in case anyone else wants to use it. Obviously it needs more error checking, not hard-coded urls, etc but you get the idea. Note, you need to download Beautiful Soup
#!/usr/bin/python import sys from bs4 import BeautifulSoup as BS from urllib2 import urlopen import re TRACKER_BASE_URL = 'http://my.tracker.com:50030/' trackerURLformat = TRACKER_BASE_URL + 'jobtasks.jsp?jobid=%s&type=%s&pagenum=1' # use map or reduce for the type def findLogs(url): finalLog = "" print "Looking for Job: " + url html = urlopen(url).read() trackerSoup = BS(html) taskURLs = [h.get('href') for h in trackerSoup.find_all(href=re.compile('taskdetails'))] # Now that we know where all the tasks are, go find their logs logURLs = [] for taskURL in taskURLs: taskHTML = urlopen(TRACKER_BASE_URL + taskURL).read() taskSoup = BS(taskHTML) allLogURL = taskSoup.find(href=re.compile('all=true')).get('href') logURLs.append(allLogURL) # Now fetch the stdout log from each for logURL in logURLs: logHTML = urlopen(logURL).read() logSoup = BS(logHTML) stdoutText = logSoup.body.pre.text.lstrip() finalLog += stdoutText return finalLog def main(argv): with open(argv[1] + "-map-stdout.log", "w") as f: f.write(findLogs(trackerURLformat % (argv[1], "map"))) print "Wrote mapers stdouts to " + f.name with open(argv[1] + "-reduce-stdout.log", "w") as f: f.write(findLogs(trackerURLformat % (argv[1], "reduce"))) print "Wrote reducer stdouts to " + f.name if __name__ == "__main__": main(sys.argv)

这篇关于整合MapReduce日志的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

整合MapReduce日志 [英] Consolidate MapReduce logs

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

整合MapReduce日志 [英] Consolidate MapReduce logs

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭