整合MapReduce日志 [英] Consolidate MapReduce logs

查看:106
本文介绍了整合MapReduce日志的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

调试Hadoop map-reduce作业是一件痛苦的事情。我可以打印出标准输出,但这些日志显示在运行MR作业的所有不同机器上。我可以找工作跟踪者,找到我的工作,然后点击每个单独的映射器以获取它的任务日志,但是当您有20多个映射器/缩减器时,这非常麻烦。



我想我可能需要编写一个脚本,通过作业跟踪器了解每台映射器/缩减器在哪台机器上运行,然后将日志返回到一个中心位置,编辑在一起。在我浪费时间做这件事之前,是否有人知道一个更好的方法来获得一个,整合stdout日志以供作业的映射器和缩减器? >

所以我最终创建了一个Python脚本来完成这个任务。这并不可怕。这是脚本,以防其他人想要使用它。显然它需要更多的错误检查,而不是硬编码的URL等,但你明白了。请注意,您需要下载美丽的汤

 #!/ usr / bin / python 
从bs4导入sys
导入BeautifulSoup作为BS $ b $ from urllib2导入urlopen
import re

TRACKER_BASE_URL ='http://my.tracker.com:50030/'
trackerURLformat = TRACKER_BASE_URL +'jobtasks.jsp?jobid =%s& type =%s& amp ; pagenum = 1'#使用map或reduce类型

def findLogs(url):
finalLog =

print寻找工作: + url
html = urlopen(url).read()
trackerSoup = BS(html)
taskURLs = [h.get('href')for h in trackerSoup.find_all(href = re.compile('taskdetails'))]

#现在我们知道了所有任务的位置,在taskURLs中找到他们的日志
logURLs = []
for taskURL:
taskHTML = urlopen(TRACKER_BASE_URL + taskURL).read()
taskSoup = BS(taskHTML)
allLo gURL = taskSoup.find(href = re.compile('all = true')).get('href')
logURLs.append(allLogURL)

#现在获取标准输出日志(logURL).read()
logSoup = BS(logHTML)
stdoutText = logSoup.body.pre.text.lstrip ()
finalLog + = stdoutText

return finalLog


def main(argv):
with open(argv [1] + -map-stdout.log,w)为f:
f.write(findLogs(trackerURLformat%(argv [1],map)))
print写mapers stdouts为+ f.name

with open(argv [1] +-reduce-stdout.log,w)as f:
f.write(findLogs(trackerURLformat%如果__name__ ==__main__:
main(sys。argv [1],reduce)))
print argv)


Debugging Hadoop map-reduce jobs is a pain. I can print out to stdout, but these logs show up on all of the different machines on which the MR job was run. I can go to the jobtracker, find my job, and click on each individual mapper to get to its task log, but this is extremely cumbersome when you have 20+ mapper/reducers.

I was thinking that I might have to write a script that would scape through the job tracker to figure out what machine each of the mappers/reducers ran on and then scp the logs back to one central location where they could be cat'ed together. Before I waste my time doing this, does someone know of a better way to get one, consolidated stdout log for a job's mappers and reducers?

解决方案

So I ended up just creating a Python script to do this. It wasn't horrible. Here's the script in case anyone else wants to use it. Obviously it needs more error checking, not hard-coded urls, etc but you get the idea. Note, you need to download Beautiful Soup

#!/usr/bin/python
import sys
from bs4 import BeautifulSoup as BS
from urllib2 import urlopen
import re

TRACKER_BASE_URL = 'http://my.tracker.com:50030/'
trackerURLformat = TRACKER_BASE_URL + 'jobtasks.jsp?jobid=%s&type=%s&pagenum=1' # use map or reduce for the type

def findLogs(url):
    finalLog = ""

    print "Looking for Job: " + url
    html = urlopen(url).read()
    trackerSoup = BS(html)
    taskURLs = [h.get('href') for h in trackerSoup.find_all(href=re.compile('taskdetails'))]

    # Now that we know where all the tasks are, go find their logs
    logURLs = []
    for taskURL in taskURLs:
        taskHTML = urlopen(TRACKER_BASE_URL + taskURL).read()
        taskSoup = BS(taskHTML)
        allLogURL = taskSoup.find(href=re.compile('all=true')).get('href')
        logURLs.append(allLogURL)

    # Now fetch the stdout log from each
    for logURL in logURLs:
        logHTML = urlopen(logURL).read()
        logSoup = BS(logHTML)
        stdoutText = logSoup.body.pre.text.lstrip()
        finalLog += stdoutText

    return finalLog


def main(argv):
    with open(argv[1] + "-map-stdout.log", "w") as f:
        f.write(findLogs(trackerURLformat % (argv[1], "map")))
        print "Wrote mapers stdouts to " + f.name

    with open(argv[1] + "-reduce-stdout.log", "w") as f:
        f.write(findLogs(trackerURLformat % (argv[1], "reduce")))
        print "Wrote reducer stdouts to " + f.name

if __name__ == "__main__":
    main(sys.argv)

这篇关于整合MapReduce日志的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆