导出Oozie Web控制台中列出的作业 [英] Exporting jobs listed in Oozie Web Console
问题描述
如果这个问题听起来很基本,我很抱歉,我完全是Hadoop环境中的新手。
>就我而言,计划每天都有计划运行的作业,并且我希望每天在Excel表格中导出失败的作业列表。
我如何查看工作流程作业?
目前我使用Oozie Web控制台查看作业,我没有/看到一个选项导出。此外,
我无法从 Oozie文档。
然而,我发现作业可以使用命令列出,比如
$ oozie jobs -oozie http:// localhost:8080 / oozie -localtime -len 2 -fliter status = RUNNING
我想过滤给定日期的失败作业,想要将其导出为csv / excel data 。
@YoungHobbit正确地指向该帖子与这个非常相似;当涉及到使用 Oozie CLI (命令行界面)提取在特定日期运行的所有作业列表时,他的答案已经停止了。 >
只要不要忘记指定一个无限的答复,例如 -len 999999999
以避免副作用(默认仅显示前100个匹配项,如果您运行大量频繁工作,则可能会太低)。
诀窍是您可以制作更复杂的过滤器,例如
startCreatedTime = 2016-06-28T00 :00Z; endcreatedtime = 2016-06-28T10:00Z; status = FAILED
...但您无法请求失败 或 已被KILLED 或 为SUSPENDED (可能是暂时的YARN或HDFS中断) 或 仍然存在可疑RUNNING (因为子工作流例如SUSPENDED)。
因此,您最好选择整个列表,然后用另一个答案中的建议过滤掉所有有SUCCEEDED的作业,使用一个普通的 grep
-
然后你还需要一个复杂的 sed
或 awk
脚本来将丑陋的CLI输出分解成一个格式良好的CSV。哎哟!
现在,您可以选择使用Oozie CLI: Oozie REST API (旧的Cloudera教程这里,Oozie V4.2参考这里)让你用任何编程语言查询Oozie服务器>提供... $ / $>
- 一个HTTP客户端
- 和解析JSON消息的方法使用普通的旧正则表达式,如果没有其他可用)
逻辑将基本相同 - 获取所有作业的列表在所需的时间窗口中,忽略SUCCEEDED作业,解析其他人以生成CSV记录,转储为CSV文件。
但是您的程序将更加健壮,因为它将基于结构 JSON输入。
还有一件事:if您熟悉Microsoft VBA,甚至可以使用Excel宏以自助服务的方式动态构建报告。无需打扰中间的CSV文件。
Apologies if this question sounds basic, I'm totally new to Hadoop environment.
What am I looking for?
In my case, there are jobs scheduled to run everday and I would want to export the list of failed jobs in an excel sheet each day.
How do I view the workflow jobs?
Currently I use the Oozie web console to view the jobs and I don't have/see an option to export. Also, I was not able to find this information from the Oozie documentation.
However, I found that jobs can be listed using commands like
$ oozie jobs -oozie http://localhost:8080/oozie -localtime -len 2 -fliter status=RUNNING
Where am I stuck?
I want to filter the failed jobs for a given date and would want to export it as csv/excel data.
@YoungHobbit was right to point at that post which is very similar to this one; his answer was dead on target when it comes to extracting the entire list of jobs that have run on a specific day with the Oozie CLI (command-line interface).
Just don't forget to specify an "unbounded" reply e.g. -len 999999999
to avoid side effects (defaut is to show only the first 100 matches, which may be way too low if you run a lot of frequent jobs).
The trick is that you can make a more complex filter such as
"startCreatedTime=2016-06-28T00:00Z;endcreatedtime=2016-06-28T10:00Z;status=FAILED"
... but you cannot request jobs that have FAILED or have been KILLED or have been SUSPENDED (which may result from a temporary YARN or HDFS outage) or are still suspiciously RUNNING (because a sub-workflow is SUSPENDED for instance).
So your best choice is to get the whole list, then filter out all jobs that have SUCCEEDED, with a plain old grep
-- as suggested in another answer.
Then you will also need a complex sed
or awk
script to break down the ugly CLI output into a well-formed CSV. Ouch!
Now, you have an alternative to the Oozie CLI: the Oozie REST API (old Cloudera tutorial here, reference for Oozie V4.2 here) lets you query the Oozie server with any programming language that provides...
- an HTTP client
- and a way to parse JSON messages (using plain old regular expressions, if nothing else is available)
The logic would be basically the same -- fetch the list of all jobs in the desired time window, ignore SUCCEEDED jobs, parse the others to generate a CSV record, dump into a CSV file.
But your program would be more robust, since it would be based on structured JSON input.
One more thing: if you are familiar with Microsoft VBA, you can even use an Excel macro to build the report dynamically, in a self-service way. No need to bother with in intermediate CSV file.
这篇关于导出Oozie Web控制台中列出的作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!