Oozie shell脚本操作 [英] Oozie shell script action

查看:330
本文介绍了Oozie shell脚本操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在探索Oozie管理Hadoop工作流程的功能。我试图设置一个调用一些配置单元命令的shell动作。我的shell脚本hive.sh如下所示:

 #!/ bin / bash 
hive -f hivescript

其中hive脚本(已独立测试)创建一些表格等等。我的问题是在何处保留hivescript,然后如何从shell脚本中引用它。

我尝试了两种方式,第一种使用本地路径,如 hive -f / local / path / to / file ,并使用上面的相对路径, hive -f hivescript ,在这种情况下,我将hivescript保存在oozie应用程序路径目录中(与hive.sh相同, workflow.xml),并将其设置为通过workflow.xml进入分布式缓存。



使用这两种方法我会收到错误消息:
<$ c在oozie Web控制台上执行$ c> Main class [org.apache.oozie.action.hadoop.ShellMain],退出代码[1]。此外,我已经尝试在shell脚本中使用hdfs路径,并且据我所知这不起作用。



我的job.properties文件:

  nameNode = hdfs:// sandbox:8020 
jobTracker = hdfs:// sandbox:50300
queueName = default
oozie.libpath = $ {nameNode} / user / oozie / share / lib
oozie.use.system.libpath = true
oozieProjectRoot = $ {nameNode} / user / sandbox / poc1
appPath = $ {oozieProjectRoot} / testwf
oozie.wf.application.path = $ {appPath}

和workflow.xml:





 < shell xmlns =uri:oozie:shell-action:0.1> 

< job-tracker> $ {jobTracker}< / job-tracker>

< name-node> $ {nameNode}< / name-node>

<配置>

<属性>

< name> mapred.job.queue.name< / name>

<值> $ {queueName}< /值>

< / property>

< / configuration>

< exec> $ {appPath} /hive.sh</exec>

< file> $ {appPath} /hive.sh</file>

< file> $ {appPath} / hive_pill< / file>

< / shell>


< error to =end/>

< / action>

< end name =end/>



我的目标是使用oozie要通过shell脚本调用配置单元脚本,请给出您的建议。

解决方案

有一件事对Oozie工作流程一直非常棘手是bash脚本的执行。
Hadoop被创建为大规模并行,所以架构的行为与您想象的完全不同。

当oozie工作流程执行一个shell动作时,它将从您的作业追踪器或YARN接收集群中任何节点上的资源。这意味着为文件使用本地位置将不起作用,因为本地存储专门位于您的边缘节点上。如果这个工作碰巧在你的边缘节点上产生,那么它就会工作,但是任何其他时间它都会失败,并且这个分布是随机的。为了解决这个问题,我发现最好将我需要的文件(包括sh脚本)存储在hdfs中的lib空间或与我的工作流程相同的位置。



这是一个很好的方式来处理你想要达到的目标。

 < shell xmlns =uri:oozie:shell-action:0.1> 

< exec> hive.sh< / exec>
< file> /user/lib/hive.sh#hive.sh< / file>
< file> ETL_file1.hql#hivescript< / file>

< / shell>

有一点你会注意到,exec只是hive.sh,因为我们假定文件将被移动到shell操作完成的基础目录中。为了确保最后一个注释是真的,你必须包含文件的hdfs路径,这将强制oozie到用该动作分发该文件。 在你的情况中,配置单元脚本启动器应该只编写一次,并简单地提供不同的文件。由于我们有一对多的关系,因此hive.sh应该保存在lib中,而不是随每个工作流程



最后您会看到以下行:

 < file> ETL_file1 .hql#hivescript< /文件> 

这条线有两件事。在#之前我们有文件的位置。它只是文件名,因为我们应该使用我们的工作流程来分发我们独特的配置单元文件。

  user / directory / workflow.xml 
user / directory / ETL_file1.hql

并且运行sh的节点将把它分配给它自动的。最后,#之后的部分是我们在sh脚本内部分配两个变量的名称。这使您可以反复重复使用相同的脚本,并为其提供不同的文件。

目录说明,

如果文件与工作流程嵌套在同一个目录中,那么您只需需要指定子路径:

  user / directory / workflow.xml 
user / directory / hive / ETL_file1.hql

会产生:

 <文件>蜂巢/ ETL_file1.hql#hivescript< /文件> 

但是,如果路径位于工作流目录之外,您将需要完整路径:

  user / directory / workflow.xml 
user / lib / hive.sh

会产生:

 < file> / user /lib/hive.sh#hive.sh</file> 

我希望这有助于每个人。


I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like:

#!/bin/bash
hive -f hivescript

Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script.

I've tried two ways, first using a local path, like hive -f /local/path/to/file, and using a relative path like above, hive -f hivescript, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml.

With both methods I get the error message: "Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]" on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know.

My job.properties file:

nameNode=hdfs://sandbox:8020
jobTracker=hdfs://sandbox:50300   
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozieProjectRoot=${nameNode}/user/sandbox/poc1
appPath=${oozieProjectRoot}/testwf
oozie.wf.application.path=${appPath}

And workflow.xml:

<shell xmlns="uri:oozie:shell-action:0.1">

    <job-tracker>${jobTracker}</job-tracker>

    <name-node>${nameNode}</name-node>

    <configuration>

        <property>

            <name>mapred.job.queue.name</name>

            <value>${queueName}</value>

        </property>

    </configuration>

    <exec>${appPath}/hive.sh</exec>

    <file>${appPath}/hive.sh</file> 

    <file>${appPath}/hive_pill</file>

</shell>

<ok to="end"/>

<error to="end"/>

</action>

<end name="end"/>

My objective is to use oozie to call a hive script through a shell script, please give your suggestions.

解决方案

One thing that has always been tricky about Oozie workflows is the execution of bash scripts. Hadoop is created to be massively parallel so the architecture acts very different than you would think.

When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.

To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.

Here is a good way to approach what you are trying to achieve.

<shell xmlns="uri:oozie:shell-action:0.1">

    <exec>hive.sh</exec> 
    <file>/user/lib/hive.sh#hive.sh</file>
    <file>ETL_file1.hql#hivescript</file>

</shell>

One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed

To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.

Lastly you see the line:

<file>ETL_file1.hql#hivescript</file>

This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows

user/directory/workflow.xml
user/directory/ETL_file1.hql

and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.

HDFS directory notes,

if the file is nested inside the same directory as the workflow, then you only need to specify child paths:

user/directory/workflow.xml
user/directory/hive/ETL_file1.hql

Would yield:

<file>hive/ETL_file1.hql#hivescript</file>

But if the path is outside of the workflow directory you will need the full path:

user/directory/workflow.xml
user/lib/hive.sh

would yield:

<file>/user/lib/hive.sh#hive.sh</file>

I hope this helps everyone.

这篇关于Oozie shell脚本操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆