导出Oozie Web控制台中列出的作业 [英] Exporting jobs listed in Oozie Web Console

查看:102
本文介绍了导出Oozie Web控制台中列出的作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果这个问题听起来很基本,我很抱歉,我完全是Hadoop环境中的新手。

>

就我而言,计划每天都有计划运行的作业,并且我希望每天在Excel表格中导出失败的作业列表



我如何查看工作流程作业?



目前我使用Oozie Web控制台查看作业,我没有/看到一个选项导出。此外,
我无法从 Oozie文档

然而,我发现作业可以使用命令列出,比如

$ oozie jobs -oozie http:// localhost:8080 / oozie -localtime -len 2 -fliter status = RUNNING



我想过滤给定日期的失败作业,想要将其导出为csv / excel data

解决方案

@YoungHobbit正确地指向该帖子与这个非常相似;当涉及到使用 Oozie CLI (命令行界面)提取在特定日期运行的所有作业列表时,他的答案已经停止了。 >
只要不要忘记指定一个无限的答复,例如 -len 999999999 以避免副作用(默认仅显示前100个匹配项,如果您运行大量频繁工作,则可能会太低)。



诀窍是您可以制作更复杂的过滤器,例如

   startCreatedTime = 2016-06-28T00 :00Z; endcreatedtime = 2016-06-28T10:00Z; status = FAILED

...但您无法请求失败 已被KILLED 为SUSPENDED (可能是暂时的YARN或HDFS中断) 仍然存在可疑RUNNING (因为子工作流例如SUSPENDED)

因此,您最好选择整个列表,然后用另一个答案中的建议过滤掉所有有SUCCEEDED的作业,使用一个普通的 grep -



然后你还需要一个复杂的 sed awk 脚本来将丑陋的CLI输出分解成一个格式良好的CSV。哎哟!



现在,您可以选择使用Oozie CLI: Oozie REST API (旧的Cloudera教程这里,Oozie V4.2参考这里)让你用任何编程语言查询Oozie服务器>提供... $ / $>


  • 一个HTTP客户端

  • 和解析JSON消息的方法使用普通的旧正则表达式,如果没有其他可用)



逻辑将基本相同 - 获取所有作业的列表在所需的时间窗口中,忽略SUCCEEDED作业,解析其他人以生成CSV记录,转储为CSV文件。

但是您的程序将更加健壮,因为它将基于结构 JSON输入。



还有一件事:if您熟悉Microsoft VBA,甚至可以使用Excel宏以自助服务的方式动态构建报告。无需打扰中间的CSV文件。


Apologies if this question sounds basic, I'm totally new to Hadoop environment.

What am I looking for?

In my case, there are jobs scheduled to run everday and I would want to export the list of failed jobs in an excel sheet each day.

How do I view the workflow jobs?

Currently I use the Oozie web console to view the jobs and I don't have/see an option to export. Also, I was not able to find this information from the Oozie documentation.

However, I found that jobs can be listed using commands like

$ oozie jobs -oozie http://localhost:8080/oozie -localtime -len 2 -fliter status=RUNNING

Where am I stuck?

I want to filter the failed jobs for a given date and would want to export it as csv/excel data.

解决方案

@YoungHobbit was right to point at that post which is very similar to this one; his answer was dead on target when it comes to extracting the entire list of jobs that have run on a specific day with the Oozie CLI (command-line interface).
Just don't forget to specify an "unbounded" reply e.g. -len 999999999 to avoid side effects (defaut is to show only the first 100 matches, which may be way too low if you run a lot of frequent jobs).

The trick is that you can make a more complex filter such as
  "startCreatedTime=2016-06-28T00:00Z;endcreatedtime=2016-06-28T10:00Z;status=FAILED"
... but you cannot request jobs that have FAILED or have been KILLED or have been SUSPENDED (which may result from a temporary YARN or HDFS outage) or are still suspiciously RUNNING (because a sub-workflow is SUSPENDED for instance).
So your best choice is to get the whole list, then filter out all jobs that have SUCCEEDED, with a plain old grep -- as suggested in another answer.

Then you will also need a complex sed or awk script to break down the ugly CLI output into a well-formed CSV. Ouch!


Now, you have an alternative to the Oozie CLI: the Oozie REST API (old Cloudera tutorial here, reference for Oozie V4.2 here) lets you query the Oozie server with any programming language that provides...

  • an HTTP client
  • and a way to parse JSON messages (using plain old regular expressions, if nothing else is available)

The logic would be basically the same -- fetch the list of all jobs in the desired time window, ignore SUCCEEDED jobs, parse the others to generate a CSV record, dump into a CSV file.
But your program would be more robust, since it would be based on structured JSON input.

One more thing: if you are familiar with Microsoft VBA, you can even use an Excel macro to build the report dynamically, in a self-service way. No need to bother with in intermediate CSV file.

这篇关于导出Oozie Web控制台中列出的作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆