在Hive中创建外部Avro表时,将Sqoop导入为Avro数据文件会将所有值都设置为NULL [英] Sqoop import as Avro data file gives all values as NULL when creating external Avro table in Hive

查看:199
本文介绍了在Hive中创建外部Avro表时,将Sqoop导入为Avro数据文件会将所有值都设置为NULL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过Sqoop导入自由格式查询将Oracle DB数据导入HDFS,方法是使用Oemie调度程序使用'-as-avrodatafile'联接两个表.以下是我的工作流程.xml的内容:

I am trying to import Oracle DB data into HDFS using Sqoop import free-form query by joining two tables using '--as-avrodatafile' using Oozie scheduler. Following is the content of my workflow.xml:

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.2" name="sqoop-freeform-wf">
    <start to="sqoop-freeform-node"/>

    <action name="sqoop-freeform-node">
        <sqoop xmlns="uri:oozie:sqoop-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <prepare>
                <delete path="/apps/hive/warehouse/loc_avro_import"/>
            </prepare>
            <arg>import</arg>
            <arg>--connect</arg>
            <arg>jdbc:oracle:thin:@connection-string:1521:ORCL</arg>
            <arg>--username</arg>
            <arg>comcast</arg>
            <arg>--password</arg>
            <arg>comcast123</arg>
            <arg>--query</arg>
            <arg>select location.location_id, location.street1,location_meta.display_name from location join location_meta on location.location_id=location_meta.location_id WHERE $CONDITIONS</arg>
            <arg>--target-dir</arg>
            <arg>/apps/hive/warehouse/loc_avro_import</arg>
            <arg>--split-by</arg>
            <arg>location.location_id</arg>
            <arg>--as-avrodatafile</arg>
            <arg>-m</arg>
            <arg>1</arg>
        </sqoop>
        <ok to="end"/>
        <error to="fail"/>
    </action>

    <kill name="fail">
        <message>Sqoop free form failed</message>
    </kill>
    <end name="end"/>
</workflow-app>

Oozie作业成功运行,并且还在HDFS上的目录/apps/hive/warehouse/loc_avro_import 下创建了一个Avro文件以及_SUCCESS标志.然后,使用以下Hive脚本在此路径上创建一个外部表:

The Oozie job runs successfully and also creates an Avro file on HDFS under directory /apps/hive/warehouse/loc_avro_import along with _SUCCESS flag. Then I create an external table on this path using following Hive script:

CREATE external TABLE avro_location(LOCATION_ID string, STREET1 string, DISPLAY_NAME string)
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  location
  '/apps/hive/warehouse/loc_avro_import';

该表也已成功创建,但是当我尝试使用Hive shell重试记录时,它返回的行数与在Oracle中执行自由格式查询时返回的行数相同.但是所有行的数据均为NULL.我还尝试使用以下命令查看配置单元的INFO日志:

The table also is created successfully, but when I try to retry the records using Hive shell it returns same number of rows which it returned when I executed free-form query in Oracle. But data comes as NULL for all the rows. I also tried to see the INFO logs of hive using following command:

hive --hiveconf hive.root.logger=INFO,console

下面是我得到的输出:

hive> select * from avro_location;
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.run from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=TimeToSubmit from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=compile from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=parse from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parsing command: select * from avro_location
15/10/23 15:12:02 [main]: INFO parse.ParseDriver: Parse Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=parse start=1445627522004 end=1445627522004 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Starting Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed phase 1 of Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for source tables
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for subqueries
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Get metadata for destination tables
15/10/23 15:12:02 [main]: INFO ql.Context: New scratch dir is hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed getting MetaData in Semantic Analysis
15/10/23 15:12:02 [main]: INFO parse.BaseSemanticAnalyzer: Not invoking CBO because the statement has too few joins
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO common.FileUtils: Creating directory if it doesn't exist: hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Set stats collection dir : hdfs://sandbox.hortonworks.com:8020/tmp/hive/root/061a4722-0a70-4c28-8b5c-1bf82b63d09f/hive_2015-10-23_15-12-02_004_2341151357389322335-1/-mr-10000/.hive-staging_hive_2015-10-23_15-12-02_004_2341151357389322335-1/-ext-10002
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for FS(2)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for SEL(1)
15/10/23 15:12:02 [main]: INFO ppd.OpProcFactory: Processing for TS(0)
15/10/23 15:12:02 [main]: INFO parse.CalcitePlanner: Completed plan generation
15/10/23 15:12:02 [main]: INFO ql.Driver: Semantic Analysis Completed
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=semanticAnalyze start=1445627522005 end=1445627522040 duration=35 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: columnComments is 
15/10/23 15:12:02 [main]: INFO avro.AvroSerDe: Avro schema is {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing operator TS[0]
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initialization Done 0 TS done is reset.
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Operator 0 TS initialized
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: Initializing children of 0 TS
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing child 1 SEL
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing operator SEL[1]
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: SELECT struct<location_id:string,street1:string,display_name:string>
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initialization Done 1 SEL done is reset.
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Operator 1 SEL initialized
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: Initializing children of 1 SEL
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing child 3 OP
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initializing operator OP[3]
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Initialization Done 3 OP done is reset.
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: Operator 3 OP initialized
15/10/23 15:12:02 [main]: INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:avro_location.location_id, type:string, comment:null), FieldSchema(name:avro_location.street1, type:string, comment:null), FieldSchema(name:avro_location.display_name, type:string, comment:null)], properties:null)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=compile start=1445627522003 end=1445627522041 duration=38 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=acquireReadWriteLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO lockmgr.DbTxnManager: Setting lock request transaction to txnid:0 for queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Requesting: queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockRequest(component:[LockComponent(type:SHARED_READ, level:TABLE, dbname:default, tablename:avro_location)], txnid:0, user:root, hostname:ip-sandbox.hortonworks.com)
15/10/23 15:12:02 [main]: INFO lockmgr.DbLockManager: Response to queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012 LockResponse(lockid:78, state:ACQUIRED)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=acquireReadWriteLocks start=1445627522041 end=1445627522050 duration=9 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=Driver.execute from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO ql.Driver: Starting command(queryId=root_20151023151202_5e68efe1-1176-485b-9014-301c99198012): select * from avro_location
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PreHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522050 end=1445627522050 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=TimeToSubmit start=1445627522003 end=1445627522050 duration=47 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=runTasks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=runTasks start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO hooks.ATSHook: Created ATS Hook
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=PostHook.org.apache.hadoop.hive.ql.hooks.ATSHook start=1445627522051 end=1445627522051 duration=0 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.execute start=1445627522050 end=1445627522051 duration=1 from=org.apache.hadoop.hive.ql.Driver>
OK
15/10/23 15:12:02 [main]: INFO ql.Driver: OK
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522052 end=1445627522118 duration=66 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=Driver.run start=1445627522003 end=1445627522118 duration=115 from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO mapred.FileInputFormat: Total input paths to process : 1
15/10/23 15:12:02 [main]: INFO avro.AvroGenericRecordReader: Found the avro schema in the job: {"type":"record","name":"avro_location","namespace":"default","fields":[{"name":"location_id","type":["null","string"],"default":null},{"name":"street1","type":["null","string"],"default":null},{"name":"display_name","type":["null","string"],"default":null}]}
15/10/23 15:12:02 [main]: INFO avro.AvroDeserializer: Adding new valid RRID :678ef2a1:150961947b5:-7fff
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
NULL    NULL    NULL
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 finished. closing... 
15/10/23 15:12:02 [main]: INFO exec.ListSinkOperator: 3 Close done
15/10/23 15:12:02 [main]: INFO exec.SelectOperator: 1 Close done
15/10/23 15:12:02 [main]: INFO exec.TableScanOperator: 0 Close done
Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO CliDriver: Time taken: 0.115 seconds, Fetched: 30 row(s)
15/10/23 15:12:02 [main]: INFO log.PerfLogger: <PERFLOG method=releaseLocks from=org.apache.hadoop.hive.ql.Driver>
15/10/23 15:12:02 [main]: INFO log.PerfLogger: </PERFLOG method=releaseLocks start=1445627522136 end=1445627522136 duration=0 from=org.apache.hadoop.hive.ql.Driver>

如上所述,它获取30行,但所有值均为NULL.谁能帮我解决这个问题.

As mentioned above it fetches 30 rows but all the values are NULL. Can anyone please help how can I resolve this issue.

推荐答案

我认为您需要为表指定avro模式位置,如下所示: TBLPROPERTIES('avro.schema.url'='/apps/hive/warehouse/loc_avro_import/schemaname.avsc');

I think you need to specify avro schema location for the table like this: TBLPROPERTIES ('avro.schema.url'='/apps/hive/warehouse/loc_avro_import/schemaname.avsc');

这篇关于在Hive中创建外部Avro表时,将Sqoop导入为Avro数据文件会将所有值都设置为NULL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆