自定义udf的配置单元查询执行正在考虑使用Oozie Flow在CDH4中使用hdfs jar路径而不是本地路径 [英] Hive query execution for custom udf is exepecting hdfs jar path instead of local path in CDH4 with Oozie flow

查看:153
本文介绍了自定义udf的配置单元查询执行正在考虑使用Oozie Flow在CDH4中使用hdfs jar路径而不是本地路径的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在从CDH3迁移到CDH4,作为此迁移的一部分,我们正在转移CDH3上的所有工作.我们注意到了一个关键问题,当通过oozie执行工作流以执行内部调用hive查询(hive -e {query})的python脚本时,在此hive查询中,我们使用add添加了一个自定义jar jar {LOCAL PATH FOR JAR},并为自定义udf创建了一个临时函数.直到这里看起来还可以.但是,当查询开始使用自定义udf函数执行时,由于分布式缓存,文件未找到异常而失败,该异常正在HDFS路径中查找jar,而不是在本地路径中查找lookig.

We are migrating from CDH3 to CDH4 and as part of this migration we are moving all the jobs that we have on CDH3. We have noticed one critical issue in this, when a work flow is executed through oozie for executing a python script which internally invoked a hive query(hive -e {query}), here in this hive query we are adding a custom jar using add jar {LOCAL PATH FOR JAR}, and created a temporary function for custom udf. And it looks ok till here. But when the query started executing with custom udf funtion it is failing with Distributed cache, File Not Found Exception which is looking for jar in the HDFS path instead of lookig in local path.

我不确定我是否在这里缺少一些配置.

I am not sure if I am missing some configuration here.

执行跟踪:

警告:不建议使用org.apache.hadoop.metrics.jvm.EventCounter. 请在所有 log4j.properties文件.执行日志位于: /tmp/yarn/yarn_20131107020505_79b41443-b9f4-4d36-a0eb-4f0d79cd3ce9.log java.io.FileNotFoundException:文件不存在: hdfs://aa.bb.com:8020/opt/nfsmount/mypath/custom.jar 在org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824) 在org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288) 在org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224) 在org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93) ..... .....

WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated. Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files. Execution log at: /tmp/yarn/yarn_20131107020505_79b41443-b9f4-4d36-a0eb-4f0d79cd3ce9.log java.io.FileNotFoundException: File does not exist: hdfs://aa.bb.com:8020/opt/nfsmount/mypath/custom.jar at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:824) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93) ..... .....

对此有任何帮助,我们深表感谢.

any help on this is highly appreciated.

关于, GHK.

推荐答案

有一些选择.在运行配置单元查询之前,所有必需的jar都应该在类路径中.

There are some few options. All the required jar should be in the classpath before you run hive query.

选项1:在oozie工作流程中通过<file>/hdfs/path/to/your/jar</file>添加自定义jar

option 1: Add your custom jar by <file>/hdfs/path/to/your/jar</file> in oozie workflow

选项2:在使用python调用蜂巢脚本时使用属性--auxpath /local/path/to/your/jar.例如:hive --auxpath /local/path/to/your.jar -e {query}

option 2: use attribute --auxpath /local/path/to/your/jar while calling your hive script in python. Eg: hive --auxpath /local/path/to/your.jar -e {query}

这篇关于自定义udf的配置单元查询执行正在考虑使用Oozie Flow在CDH4中使用hdfs jar路径而不是本地路径的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆