为什么AWS EMR中缺少hive_staging文件 [英] Why hive_staging file is missing in AWS EMR

查看:514
本文介绍了为什么AWS EMR中缺少hive_staging文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题 -



我在AWS EMR中运行1个查询。它通过抛出异常失败 -

  java.io.FileNotFoundException:文件s3:// xxx / yyy / internal_test_automation / 2016 / 09/17/17156 / data / feed / commerce_feed_redshift_dedup / .hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639不存在。 

我在下面提到了这个问题的所有相关信息。请检查。

查询 -

  INSERT OVERWRITE TABLE base_performance_order_dedup_20160917 
选择
*
起价

选择
commerce_feed_redshift_dedup.sku AS SKU,
commerce_feed_redshift_dedup.revenue收入,
commerce_feed_redshift_dedup.orders AS订单,
commerce_feed_redshift_dedup.units AS单元,
commerce_feed_redshift_dedup.feed_date AS feed_date
。从commerce_feed_redshift_dedup
)中TB


$ b

例外 -

  ERROR执行查询时出错
java.sql.SQLException:处理语句时出错:FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.tez.TezTask返回代码2。顶点失败,vertexName =地图1,vertexId = vertex_1474097800415_0311_2_00,诊断= [顶点vertex_1474097800415_0311_2_00 [地图1]死亡/失败归因于:ROOT_INPUT_INIT_FAILURE,顶点输入:commerce_feed_redshift_dedup初始化失败,顶点= vertex_1474097800415_0311_2_00 [Map 1],java.io.FileNotFoundException:文件S3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639不存在。
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem。 listStatus(S3NativeFileSystem.java:929)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem。 listStatus(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem。
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)
at org.apache.hadoop.fs.FileSystem $ 4。(FileSystem.java:1778)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)
在org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)
在org.apache.hadoop.mapred.FileInputFormat.lis tStatus(FileInputFormat.java:201)
位于org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
位于org.apache.hadoop.hive.ql.io.HiveInputFormat。在org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits addSplitsForGroup(HiveInputFormat.java:363)
(HiveInputFormat.java:486)
。在org.apache.hadoop.hive.ql。 exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)
在org.apache.tez.dag.app.dag.RootInputInitializerManager $ InputInitializerCallable $ 1.run(RootInputInitializerManager.java:278)
。在组织$ javax.security.auth中的b $ b .Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.dag.app.dag .RootInputInitializerManager $ InputIn itializerCallable.call(RootInputInitializerManager.java:269)
位于org.apache.tez.dag.app.dag.RootInputInitializerManager $ InputInitializerCallable.call(RootInputInitializerManager.java:253)
位于java.util.concurrent。 FutureTask.run(FutureTask.java:266)$ java.util.concurrent.ThreadPoolExecutor.runWorker中的b $ b(ThreadPoolExecutor.java:1142)$ java.util.concurrent.ThreadPoolExecutor中的
$ Worker.run(ThreadPoolExecutor。
at java.lang.Thread.run(Thread.java:745)
]顶点被杀死,vertexName = Reducer 2,vertexId = vertex_1474097800415_0311_2_01,diagnostics = [顶点收到INITED状态。 ,顶点vertex_1474097800415_0311_2_01 [减速器2]因以下原因而死亡/失败:OTHER_VERTEX_FAILURE]由于VERTEX_FAILURE,DAG未成功。 failedVertices:1个killedVertices:1个
。在org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)
。在org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:在com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234 251)

。在com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
。在com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
。在sun.reflect。 NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
在java.lang.reflect.Method.invoke(Method.java:606)
at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)
at azkaban.jobtype.JavaJobRunnerMain。(JavaJobRunne rMain.java:132)
at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)

配置属性,我在执行上面的查询之前设置的。 -

 设置hivevar:hive.mapjoin.smalltable.filesize = 2000000000 
设置hivevar:mapreduce。 map.speculative = false
设置hivevar:mapreduce.output.fileoutputformat.compress = true
设置hivevar:hive.exec.compress.output = true
设置hivevar:mapreduce.task.timeout = 6000000
设置hivevar:hive.optimize.bucketmapjoin.sortedmerge = true
设置hivevar:io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec
设置hivevar:hive。 input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask = false
set hivevar:FEED_DATE = 20160917
设置hivevar:hive.optimize.bucketmapjoin = true
设置hivevar:hive.exec.compress.intermediate = true
设置hivevar:hive.enforce.bucketmapjoin = true
设置hivevar:mapred。 output.compress = true
设置hivevar:mapreduce.map.output.compress = true
设置hivevar:hive.auto.convert.sortmerge.join = false
设置hivevar:hive.auto。 convert.join = false
set hivevar:mapreduce.reduce.speculative = false
set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com
set hivevar:mapred.output.compression.codec = org.apache。 hadoop.io.compress.GzipCodec
set hive.mapjoin.smalltable.filesize = 2000000000
set mapreduce.map.speculative = false
set mapreduce.output.fileoutputformat.compress = true
set hive.exec.compress.output = true
set mapreduce.task.timeout = 6000000
set hive.optimize.bucketmapjoin.sortedmerge = true
set io.compression.codecs = org。 apache.hadoop.io.compress.GzipCodec
set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hive.auto.convert.sortmerge.join.noconditionaltask = false
set FEED_DATE = 20160917
set hive.optimize.bucketmapjoin = true
set hive.exec.compress.intermediate = true
set hive.enforce.bucketmapjoin = true
set mapred.output.compress = true
set mapreduce.map.output.compress = true
set hive.auto.convert.sortmerge.join = false
set h ive.auto.convert.join = false
设置mapreduce.reduce.speculative = false
设置PD_KEY=vijay-test-mail@XXX.pagerduty.com
设置mapred.output.compression。 codec = org.apache.hadoop.io.compress.GzipCodec



/ etc / hive / conf / hive-site.xml

 < configuration> 

<! - Hive配置既可以存储在这个文件中,也可以存储在hadoop配置文件中 - >
<! - - 它们是Hadoop设置变量所隐含的。 - >
<! - 除Hadoop设置变量外 - 提供此文件作为一种便利,以便Hive - >
<! - 用户不必编辑hadoop配置文件(可以将其作为集中管理的>
<! - 资源进行管理)。 - >

<! - Hive执行参数 - >


<属性>
<名称> hbase.zookeeper.quorum< / name>
< value> ip-172-30-2-16.us-west-2.compute.internal< / value>
< description> http://wiki.apache.org/hadoop/Hive/HBaseIntegration< / description>
< / property>

<属性>
<名称> hive.execution.engine< /名称>
<值> tez< /值>
< / property>

<属性>
<名称> fs.defaultFS< / name>
< value> hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020< / value>
< / property>


<属性>
< name> hive.metastore.uris< / name>
< value> thrift://ip-172-30-2-16.us-west-2.compute.internal:9083< / value>
< description> JDBC元数据的JDBC连接字符串< / description>
< / property>

<属性>
< name> javax.jdo.option.ConnectionURL< / name>
< value> jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306 / hive?createDatabaseIfNotExist = true< / value>
< description>用于针对Metastore数据库的用户名< / description>
< / property>

<属性>
< name> javax.jdo.option.ConnectionDriverName< / name>
< value> org.mariadb.jdbc.Driver< / value>
< description>用于针对Metastore数据库的用户名< / description>
< / property>

<属性>
< name> javax.jdo.option.ConnectionUserName< / name>
< value>配置单元< /值>
< description>用于针对Metastore数据库的用户名< / description>
< / property>

<属性>
< name> javax.jdo.option.ConnectionPassword< / name>
<值> mrN949zY9P2riCeY< /值>
< description>密码将用于Metastore数据库< / description>
< / property>

<属性>
< name> datanucleus.fixedDatastore< / name>
<值> true< /值>
< / property>

<属性>
<名称> mapred.reduce.tasks< / name>
<值> -1< /值>
< / property>

<属性>
<名称> mapred.max.split.size< / name>
<值> 256000000< /值>
< / property>

<属性>
< name> hive.metastore.connect.retries< / name>
<值> 15< /值>
< / property>

<属性>
< name> hive.optimize.sort.dynamic.partition< / name>
<值> true< /值>
< / property>

<属性>
< name> hive.async.log.enabled< / name>
<值> false< /值>
< / property>

< / configuration>



/etc/tez/conf/tez-site.xml p>

 < configuration> 
<属性>
< name> tez.lib.uris< / name>
< value> hdfs:///apps/tez/tez.tar.gz< /值>
< / property>

<属性>
< name> tez.use.cluster.hadoop-libs< / name>
<值> true< /值>
< / property>

<属性>
<名称> tez.am.grouping.max-size< / name>
<值> 134217728< /值>
< / property>

<属性>
< name> tez.runtime.intermediate-output.should-compress< / name>
<值> true< /值>
< / property>

<属性>
< name> tez.runtime.intermediate-input.is-compressed< / name>
<值> true< /值>
< / property>

<属性>
<名称> tez.runtime.intermediate-output.compress.codec< / name>
< value> org.apache.hadoop.io.compress.LzoCodec< / value>
< / property>

<属性>
<名称> tez.runtime.intermediate-input.compress.codec< / name>
< value> org.apache.hadoop.io.compress.LzoCodec< / value>
< / property>

<属性>
<名称> tez.history.logging.service.class< /名称>
<值> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService< /值>
< / property>

<属性>
< name> tez.tez-ui.history-url.base< / name>
< value> http://ip-172-30-2-16.us-west-2.compute.internal:8080 / tez-ui /< / value>
< / property>
< / configuration>

问题 -


  1. 哪个进程删除了这个文件?对于配置单元,这个文件应该只在那里。 (另外,这个文件不是由应用程序代码创建的。)

  2. 当我运行失败的查询次数时,它通过。为什么有不明确的行为?

  3. 因为我只是将hive-exec,hive-jdbc版本升级到2.1.0。所以,它似乎像一些配置属性错误设置或某些属性丢失。你能帮我找到错误设置/遗漏的配置单元属性吗?
  4. 注意 - 我升级了0.13的hive-exec版本。 0至2.1.0。在以前的版本中,所有查询都可以正常工作。



    Update-1


    当我启动另一个群集时,它工作正常。我在同一个ETL上测试了3次。

    当我在新群集上再次执行相同的操作时,它显示相同的异常。无法理解,为什么这种歧义正在发生。



    帮助我理解这种模糊性。



    我在处理Hive方面很天真。因此,对此有较少的概念性意见。



    更新-2 - 名称:50070 -



    2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server。 blockmanagement.BlockPlacementPolicy(8020上的IPC服务器处理程序11):无法放置足够的副本,仍然需要1来达到1(unavailableStorages = [],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [] ,replicationFallbacks = [ARCHIVE]},newBlock = true)有关更多信息,请在org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy上启用DEBUG日志级别2016-09-20 11:31:55,155 WARN org.apache。 hadoop.hdfs.protocol.BlockStoragePolicy(8020上的IPC服务器处理程序11):未能放置足够的副本:预期大小为1,但只能选择0种存储类型(replication = 1,selected = [],unavailable = [DISK], removed = [DISK],policy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]} )2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020上的IPC服务器处理程序11):未能放置足够的副本,仍然需要1来达到1(unavailableStorages = [DISK],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]},newBlock = true)所有必需的存储类型不可用:unavailableStorages = [DISK],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server(8020上的IPC服务器处理程序11 ):8020上的IPC服务器处理程序11,从172.30.2.207:56462调用org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock调用#7497重试#0 java.io.IOException:File / user / hive / warehouse / bc_kmart_3813 .db / dp_internal_temp_full_load_offer_flexibility_20160920 / .hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58 / _task_tmp.-ext-10000 / _tmp.000079_0只能是复制到0节点而不是minReplication(= 1)。有1个数据节点正在运行,并且在此操作中不包含任何节点。在在org.apache org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547)在org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) .hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)在org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724)在org.apache.hadoop.hdfs .protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos $ ClientNamenodeProtocol $ 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine $ Server $ ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)at org.apache.hadoop.ipc.RPC $ Server.call(RPC.java:969)at org.apache.hadoop.ipc.Server $ Handler $ 1.run(Server.java :2049)at org.apac在org上的javax.security.auth.Subject.doAs(Subject.java:422)上的java.security.AccessController.doPrivileged(Native方法)中的he.hadoop.ipc.Server $ Handler $ 1.run(Server.java:2045) .apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)at org.apache.hadoop.ipc.Server $ Handler.run(Server.java:2043)



    当我搜索这个例外。我发现这个网页 - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

    在我的群集中,有一个数据节点具有32 GB的磁盘空间。


    $ b

    ** / etc / hive / conf / hive-default。 xml.template - **

     < property> 
    <名称> hive.exec.stagingdir< /名称>
    <值> .hive-staging< /值>
    < description>将在表格位置内创建的目录名称,以支持HDFS加密。除了只读表,这取代了查询结果的$ {hive.exec.scratchdir}。在任何情况下,$ {hive.exec.scratchdir}仍然用于其他临时文件,例如作业计划。< / description>
    < / property>

    问题 -


    1. 根据日志,hive-staging文件夹在群集机器中创建,如 / var / log / hadoop-hdfs / hadoop-hdfs-datanode-ip-172-30- 2-189.log ,那么它为什么在s3中创建相同的文件夹呢?

    Update-3 -



    一些例外类型 - LeaseExpiredException -



    2016 -09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server(8020上的IPC服务器处理程序13):8020上的IPC服务器处理程序13,调用org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from
    172.30.2.189:42958呼叫#726#重试0:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:在/ tmp /蜂巢/ Hadoop的没有租期/ _tez_session_dir / 6ebd2d18-f5b9-4176-ab8f- d6c78124b636 / .tez / application_1474442135017_0022 / recovery / 1 / summary(inode 20326):文件不存在。持有人DFSClient_NONMAPREDUCE_1375788009_1没有任何打开的文件。

    解决方案

    我解决了问题。让我详细解释一下。



    即将发生的异常 -


    1. LeaveExpirtedException - 来自HDFS端。

    2. FileNotFoundException - 来自Hive端(当Tez执行引擎执行DAG时)

    问题场景 -


    1. 我们将蜂巢版本从0.13.0升级到2.1.0。而且,以前的版本一切正常。零运行时例外。

    解决此问题的不同想法 -


    1. 首先想到的是,由于NN的智能性,两个线程正在处理同一个片段。但根据以下设置



      设置mapreduce.map.speculative = false
      设置mapreduce.reduce.speculative = false


    这是不可能的。


    1. 然后,我将计数从1000增加到100000以下设置 -



      SET hive.exec.max.dynamic.partitions = 100000;
      SET hive.exec.max.dynamic.partitions.pernode = 100000;



    2. t


      1. 然后第三个想法是,在同一个过程中,mapper-1是创建的文件被另一个映射器/缩减器删除。但是,我们在Hveserver2,Tez日志中没有发现任何这样的日志。最后,根本原因在于应用程序层代码本身。在hive-exec-2.1.0版本中,他们引入了新的配置属性



        hive.exec.stagingdir:.hive-staging


      上述属性描述 -


      目录名称这将在表格位置内创建,以便
      支持HDFS加密。这将取代$ {hive.exec.scratchdir}用于
      查询结果,但只读表格除外。在任何情况下,
      $ {hive.exec.scratchdir}仍然用于其他临时文件,例如
      作为工作计划。


      因此,如果应用程序层代码(ETL)中有任何并发​​作业,并且正在同一个表上执行操作(重命名/删除/移动),则可能会导致此问题。



      在我们的例子中,2个并发作业在同一张表上执行INSERT OVERWRITE,导致删除1个映射器的元数据文件,导致此问题。



      解析 -


      1. 将元数据文件位置移至外部表格中(表格位于S3中)。
      2. >
      3. 禁用HDFS加密(如stagingdir属性描述中所述)。
      4. 更改为应用程序层代码以避免并发问题。


      Problem -

      I am running 1 query in AWS EMR. It is failing by throwing exception -

      java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
      

      I mentioned all the related information for this problem below. Please check.

      Query -

      INSERT OVERWRITE TABLE base_performance_order_dedup_20160917
      SELECT 
      *
       FROM 
      (
      select
      commerce_feed_redshift_dedup.sku AS sku,
      commerce_feed_redshift_dedup.revenue AS revenue,
      commerce_feed_redshift_dedup.orders AS orders,
      commerce_feed_redshift_dedup.units AS units,
      commerce_feed_redshift_dedup.feed_date AS feed_date
      from commerce_feed_redshift_dedup
      ) tb
      

      Exception -

      ERROR Error while executing queries
      java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1474097800415_0311_2_00, diagnostics=[Vertex vertex_1474097800415_0311_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: commerce_feed_redshift_dedup initializer failed, vertex=vertex_1474097800415_0311_2_00 [Map 1], java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
          at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)
          at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929)
          at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)
          at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530)
          at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
          at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556)
          at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)
          at org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778)
          at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)
          at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)
          at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)
          at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
          at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486)
          at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)
      ]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1474097800415_0311_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1474097800415_0311_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
          at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)
          at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)
          at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)
          at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
          at com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)
          at azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132)
          at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)
      

      Hive Configuration properties, that I set before executing above query. -

      set hivevar:hive.mapjoin.smalltable.filesize=2000000000
      set hivevar:mapreduce.map.speculative=false
      set hivevar:mapreduce.output.fileoutputformat.compress=true
      set hivevar:hive.exec.compress.output=true
      set hivevar:mapreduce.task.timeout=6000000
      set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true
      set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
      set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
      set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false
      set hivevar:FEED_DATE=20160917
      set hivevar:hive.optimize.bucketmapjoin=true
      set hivevar:hive.exec.compress.intermediate=true
      set hivevar:hive.enforce.bucketmapjoin=true
      set hivevar:mapred.output.compress=true
      set hivevar:mapreduce.map.output.compress=true
      set hivevar:hive.auto.convert.sortmerge.join=false
      set hivevar:hive.auto.convert.join=false
      set hivevar:mapreduce.reduce.speculative=false
      set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com
      set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
      set hive.mapjoin.smalltable.filesize=2000000000
      set mapreduce.map.speculative=false
      set mapreduce.output.fileoutputformat.compress=true
      set hive.exec.compress.output=true
      set mapreduce.task.timeout=6000000
      set hive.optimize.bucketmapjoin.sortedmerge=true
      set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
      set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
      set hive.auto.convert.sortmerge.join.noconditionaltask=false
      set FEED_DATE=20160917
      set hive.optimize.bucketmapjoin=true
      set hive.exec.compress.intermediate=true
      set hive.enforce.bucketmapjoin=true 
      set mapred.output.compress=true 
      set mapreduce.map.output.compress=true 
      set hive.auto.convert.sortmerge.join=false 
      set hive.auto.convert.join=false 
      set mapreduce.reduce.speculative=false 
      set PD_KEY=vijay-test-mail@XXX.pagerduty.com 
      set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
      

      /etc/hive/conf/hive-site.xml

      <configuration>
      
      <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
      <!-- that are implied by Hadoop setup variables.                                                -->
      <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
      <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
      <!-- resource).                                                                                 -->
      
      <!-- Hive Execution Parameters -->
      
      
      <property>
        <name>hbase.zookeeper.quorum</name>
        <value>ip-172-30-2-16.us-west-2.compute.internal</value>
        <description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description>
      </property>
      
      <property>
        <name>hive.execution.engine</name>
        <value>tez</value>
      </property>
      
        <property>
          <name>fs.defaultFS</name>
          <value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value>
        </property>
      
      
        <property>
          <name>hive.metastore.uris</name>
          <value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value>
          <description>JDBC connect string for a JDBC metastore</description>
        </property>
      
        <property>
          <name>javax.jdo.option.ConnectionURL</name>
          <value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value>
          <description>username to use against metastore database</description>
        </property>
      
        <property>
          <name>javax.jdo.option.ConnectionDriverName</name>
          <value>org.mariadb.jdbc.Driver</value>
          <description>username to use against metastore database</description>
        </property>
      
        <property>
          <name>javax.jdo.option.ConnectionUserName</name>
          <value>hive</value>
          <description>username to use against metastore database</description>
        </property>
      
        <property>
          <name>javax.jdo.option.ConnectionPassword</name>
          <value>mrN949zY9P2riCeY</value>
          <description>password to use against metastore database</description>
        </property>
      
        <property>
          <name>datanucleus.fixedDatastore</name>
          <value>true</value>
        </property>
      
        <property>
          <name>mapred.reduce.tasks</name>
          <value>-1</value>
        </property>
      
        <property>
          <name>mapred.max.split.size</name>
          <value>256000000</value>
        </property>
      
        <property>
          <name>hive.metastore.connect.retries</name>
          <value>15</value>
        </property>
      
        <property>
          <name>hive.optimize.sort.dynamic.partition</name>
          <value>true</value>
        </property>
      
        <property>
          <name>hive.async.log.enabled</name>
          <value>false</value>
        </property>
      
      </configuration>
      

      /etc/tez/conf/tez-site.xml

      <configuration>
          <property>
          <name>tez.lib.uris</name>
          <value>hdfs:///apps/tez/tez.tar.gz</value>
        </property>
      
        <property>
          <name>tez.use.cluster.hadoop-libs</name>
          <value>true</value>
        </property>
      
        <property>
          <name>tez.am.grouping.max-size</name>
          <value>134217728</value>
        </property>
      
        <property>
          <name>tez.runtime.intermediate-output.should-compress</name>
          <value>true</value>
        </property>
      
        <property>
          <name>tez.runtime.intermediate-input.is-compressed</name>
          <value>true</value>
        </property>
      
        <property>
          <name>tez.runtime.intermediate-output.compress.codec</name>
          <value>org.apache.hadoop.io.compress.LzoCodec</value>
        </property>
      
        <property>
          <name>tez.runtime.intermediate-input.compress.codec</name>
          <value>org.apache.hadoop.io.compress.LzoCodec</value>
        </property>
      
        <property>
          <name>tez.history.logging.service.class</name>
          <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
        </property>
      
        <property>
          <name>tez.tez-ui.history-url.base</name>
          <value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value>
        </property>
      </configuration>
      

      Questions -

      1. Which process deleted this file ? For hive, this file should be there only. (Also, this file is not created by application code.)
      2. When I ran failed query numbers of times, it passes. Why there is ambiguous behaviour ?
      3. Since, I just upgraded hive-exec, hive-jdbc version to 2.1.0. So, it seems like some hive configuration properties wrongly set or some properties are missing. Can you help me in finding wrongly set/missed hive properties ?

      Note - I upgraded hive-exec version from 0.13.0 to 2.1.0. In previous version, all queries are working fine.

      Update-1

      When I launch another cluster, it worked fine. I tested 3 times on the same ETL.

      When I did the same thing again on new cluster, it is showing the same exception. Not able to understand, why this ambiguity is happening.

      Help me to understand this ambiguity.

      I am naive in dealing with Hive. So, have less conceptual idea about this.

      Update-2-

      hfs logs under Cluster Public DNS Name:50070 -

      2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy (IPC Server handler 11 on 8020): Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 8020): IPC Server handler 11 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 172.30.2.207:56462 Call#7497 Retry#0 java.io.IOException: File /user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58/_task_tmp.-ext-10000/_tmp.000079_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

      When I searched this exception. I found this page - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

      In my cluster, there is one data node with 32 GB disk space.

      ** /etc/hive/conf/hive-default.xml.template - **

      <property>
          <name>hive.exec.stagingdir</name>
          <value>.hive-staging</value>
          <description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description>
        </property>
      

      Questions-

      1. As per logs, hive-staging folder is created in cluster machine, as per /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log, then why it is creating same folder in s3 also ?

      Update-3-

      Some exceptions are of type - LeaseExpiredException -

      2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 8020): IPC Server handler 13 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab8f-d6c78124b636/.tez/application_1474442135017_0022/recovery/1/summary (inode 20326): File does not exist. Holder DFSClient_NONMAPREDUCE_1375788009_1 does not have any open files.

      解决方案

      I resolved the issue. Let me explain in detail.

      Exceptions that is coming -

      1. LeaveExpirtedException - from HDFS side.
      2. FileNotFoundException - from Hive side (when Tez execution engine executes DAG)

      Problem scenario-

      1. We just upgraded the hive version from 0.13.0 to 2.1.0. And, everything was working fine with previous version. Zero runtime exception.

      Different thoughts to resolve the issue -

      1. First thought was, two threads was working on same piece because of NN intelligence. But as per below settings

        set mapreduce.map.speculative=false set mapreduce.reduce.speculative=false

      that was not possible.

      1. then, I increase the count from 1000 to 100000 for below settings -

        SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000;

      that also didn't work.

      1. Then the third thought was, definitely in a same process, what mapper-1 was created was deleted by another mapper/reducer. But, we didn't found any such logs in Hveserver2, Tez logs.

      2. Finally the root cause lies in a application layer code itself. In hive-exec-2.1.0 version, they introduced new configuration property

        "hive.exec.stagingdir":".hive-staging"

      Description of above property -

      Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.

      So if there is any concurrent jobs in Application layer code (ETL), and are doing operation(rename/delete/move) on same table, then it may lead to this problem.

      And, in our case, 2 concurrent jobs are doing "INSERT OVERWRITE" on same table, that leads to delete metadata file of 1 mapper, that is causing this issue.

      Resolution -

      1. Move the metadata file location to outside table(table lies in S3).
      2. Disable HDFS encryption (as mentioned in Description of stagingdir property.)
      3. Change into your Application layer code to avoid concurrency issue.

      这篇关于为什么AWS EMR中缺少hive_staging文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆