为什么 AWS EMR 中缺少 hive_staging 文件 [英] Why hive_staging file is missing in AWS EMR

查看:31
本文介绍了为什么 AWS EMR 中缺少 hive_staging 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题 -

我正在 AWS EMR 中运行 1 个查询.它因抛出异常而失败 -

java.io.FileNotFoundException: 文件 s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-2086323283295639 不存在.

我在下面提到了这个问题的所有相关信息.请检查.

查询 -

INSERT OVERWRITE TABLE base_performance_order_dedup_20160917选择*从(选择commerce_feed_redshift_dedup.sku AS sku,commerce_feed_redshift_dedup.revenue AS 收入,commerce_feed_redshift_dedup.orders AS 订单,commerce_feed_redshift_dedup.units AS 单位,commerce_feed_redshift_dedup.feed_date AS feed_date来自 commerce_feed_redshift_dedup) 待定

例外 -

ERROR 执行查询时出错java.sql.SQLException:处理语句时出错:FAILED:执行错误,从 org.apache.hadoop.hive.ql.exec.tez.TezTask 返回代码 2.顶点失败,vertexName =地图1,vertexId = vertex_1474097800415_0311_2_00,诊断= [顶点vertex_1474097800415_0311_2_00 [图1]杀死/由于失败:ROOT_INPUT_INIT_FAILURE,顶点输入:commerce_feed_redshift_dedup初始化失败,顶点= vertex_1474097800415_0311_2_00 [图1],java.io.FileNotFoundException:文件 s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_28339333682832 不存在-6在 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)在 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929)在 com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)在 org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778)在 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)在 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)在 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)在 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)在 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)在 org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363)在 org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486)在 org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)在 java.security.AccessController.doPrivileged(Native Method)在 javax.security.auth.Subject.doAs(Subject.java:422)在 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)在 java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)在 java.lang.Thread.run(Thread.java:745)]顶点被杀死,vertexName=Reducer 2,vertexId=vertex_1474097800415_0311_2_01,diagnostics=[Vertex 在INITED 状态下收到Kill.,Vertex vertex_1474097800415_0311_2_01] 因[Reducer_DEXILFAILFA_DEXURE_DEX 没有成功被杀死]failedVertices:1 被杀顶点:1在 org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)在 org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)在 com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)在 com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)在 com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:606)在 azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)在 azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132)在 azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)

Hive 配置属性,我在执行上述查询之前设置的属性.-

set hivevar:hive.mapjoin.smalltable.filesize=2000000000设置 hivevar:mapreduce.map.speculative=false设置 hivevar:mapreduce.output.fileoutputformat.compress=true设置 hivevar:hive.exec.compress.output=true设置 hivevar:mapreduce.task.timeout=6000000设置 hivevar:hive.optimize.bucketmapjoin.sortedmerge=true设置 hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec设置 hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat设置 hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false设置配置单元:FEED_DATE=20160917设置 hivevar:hive.optimize.bucketmapjoin=true设置 hivevar:hive.exec.compress.intermediate=true设置 hivevar:hive.enforce.bucketmapjoin=true设置 hivevar:mapred.output.compress=true设置 hivevar:mapreduce.map.output.compress=true设置 hivevar:hive.auto.convert.sortmerge.join=false设置 hivevar:hive.auto.convert.join=false设置 hivevar:mapreduce.reduce.speculative=false设置 hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com设置 hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec设置 hive.mapjoin.smalltable.filesize=2000000000设置 mapreduce.map.speculative=false设置 mapreduce.output.fileoutputformat.compress=true设置 hive.exec.compress.output=true设置 mapreduce.task.timeout=6000000设置 hive.optimize.bucketmapjoin.sortedmerge=true设置 io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec设置 hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat设置 hive.auto.convert.sortmerge.join.noconditionaltask=false设置 FEED_DATE=20160917设置 hive.optimize.bucketmapjoin=true设置 hive.exec.compress.intermediate=true设置 hive.enforce.bucketmapjoin=true设置 mapred.output.compress=true设置 mapreduce.map.output.compress=true设置 hive.auto.convert.sortmerge.join=false设置 hive.auto.convert.join=false设置 mapreduce.reduce.speculative=false设置 PD_KEY=vijay-test-mail@XXX.pagerduty.com设置 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

/etc/hive/conf/hive-site.xml

<预><代码><配置><!-- Hive 配置可以存储在这个文件中,也可以存储在 hadoop 配置文件中 --><!-- 由 Hadoop 设置变量隐含的.--><!-- 除了 Hadoop 设置变量之外 - 提供此文件是为了方便 Hive --><!-- 用户不必编辑 hadoop 配置文件(可以作为集中管理--><!-- 资源).--><!-- Hive 执行参数--><财产><name>hbase.zookeeper.quorum</name><value>ip-172-30-2-16.us-west-2.compute.internal</value><描述>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description></属性><财产><name>hive.execution.engine</name><value>tez</value></属性><财产><name>fs.defaultFS</name><value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value></属性><财产><name>hive.metastore.uris</name><value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value><description>JDBC 元存储的 JDBC 连接字符串</description></属性><财产><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionDriverName</name><value>org.mariadb.jdbc.Driver</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionUserName</name><value>hive</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionPassword</name><value>mrN949zY9P2riCeY</value><description>用于 Metastore 数据库的密码</description></属性><财产><name>datanucleus.fixedDatastore</name><值>真</值></属性><财产><name>mapred.reduce.tasks</name><值>-1</值></属性><财产><name>mapred.max.split.size</name><value>256000000</value></属性><财产><name>hive.metastore.connect.retries</name><值>15</值></属性><财产><name>hive.optimize.sort.dynamic.partition</name><值>真</值></属性><财产><name>hive.async.log.enabled</name><value>false</value></属性></配置>

/etc/tez/conf/tez-site.xml

<预><代码><配置><财产><name>tez.lib.uris</name><value>hdfs:///apps/tez/tez.tar.gz</value></属性><财产><name>tez.use.cluster.hadoop-libs</name><值>真</值></属性><财产><name>tez.am.grouping.max-size</name><值>134217728</值></属性><财产><name>tez.runtime.intermediate-output.should-compress</name><值>真</值></属性><财产><name>tez.runtime.intermediate-input.is-compressed</name><值>真</值></属性><财产><name>tez.runtime.intermediate-output.compress.codec</name><value>org.apache.hadoop.io.compress.LzoCodec</value></属性><财产><name>tez.runtime.intermediate-input.compress.codec</name><value>org.apache.hadoop.io.compress.LzoCodec</value></属性><财产><name>tez.history.logging.service.class</name><value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value></属性><财产><name>tez.tez-ui.history-url.base</name><value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value></属性></配置>

问题 -

  1. 哪个进程删除了这个文件?对于 hive,这个文件应该只在那里.(此外,此文件不是由应用程序代码创建的.)
  2. 当我运行失败的查询次数时,它通过了.为什么会出现模棱两可的行为?
  3. 因为,我刚刚将 hive-exec、hive-jdbc 版本升级到 2.1.0.因此,似乎某些 hive 配置属性设置错误或缺少某些属性.你能帮我找到错误设置/遗漏的配置单元属性吗?

注意 - 我将 hive-exec 版本从 0.13.0 升级到 2.1.0.在以前的版本中,所有查询都可以正常工作.

更新 1

当我启动另一个集群时,它运行良好.我在同一个 ETL 上测试了 3 次.

当我在新集群上再次做同样的事情时,它显示了同样的异常.无法理解,为什么会出现这种歧义.

帮助我理解这种歧义.

我在与 Hive 打交道时很天真.所以,对此有较少的概念性想法.

Update-2-

hfs 记录在集群公共 DNS 名称:50070 -

2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020 上的 IPC 服务器处理程序 11):未能放置足够的副本,仍然需要 1达到 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 有关更多信息,请在 org 上启用 DEBUG 日志级别.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy(8020 上的 IPC 服务器处理程序 11):未能放置足够的副本:预期size 为 1 但只能选择 0 种存储类型(replication=1,selected=[],unavailable=[DISK],removed=[DISK],policy=BlockStoragePolicy{HOT:7,storageTypes=[DISK],creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020 上的 IPC 服务器处理程序 11):无法 place 足够的副本,仍然需要 1 才能达到 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 所有所需的存储类型不可用:unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server(8020 上的 IPC 服务器处理程序 11):8020 上的 IPC 服务器处理程序 11,从 172.30.2.207:56462 Call#7497 Retry#0 java 调用 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock.io.IOException: 文件/user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_126928130_126928130_mp30_12238130_12238t30_12238t30 只能复制_12238130d-task_mp_internal_temp_full_load_offer_flexibility_20160920节点而不是 minReplication (=1).有 1 个数据节点正在运行,此操作中没有排除任何节点.在 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) 在 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) 在 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) 在 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) 在 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) 位于 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.apache.java.hdfs.ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 在 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) 在 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

当我搜索此异常时.我找到了这个页面 - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

在我的集群中,有一个具有 32 GB 磁盘空间的数据节点.

**/etc/hive/conf/hive-default.xml.template - **

<name>hive.exec.stagingdir</name><value>.hive-staging</value> 将在表位置内创建以支持 HDFS 加密的目录名称.这是替换 ${hive.exec.scratchdir} 的查询结果,只读表除外.在所有情况下,${hive.exec.scratchdir} 仍用于其他临时文件,例如作业计划.</description></属性>

问题-

  1. 根据日志,在集群机器中创建 hive-staging 文件夹,根据 /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log,那么为什么它也在 s3 中创建相同的文件夹?

Update-3-

一些异常类型 - LeaseExpiredException -

2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server(8020 上的 IPC 服务器处理程序 13):8020 上的 IPC 服务器处理程序 13,调用 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete 来自172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 没有租约/tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab7631b4b4/application_1474442135017_0022/recovery/1/summary(inode 20326):文件不存在.持有人 DFSClient_NONMAPREDUCE_1375788009_1 没有任何打开的文件.

解决方案

我解决了这个问题.让我详细解释一下.

即将发生的异常 -

  1. LeaveExpirtedException - 来自 HDFS 端.
  2. FileNotFoundException - 来自 Hive 端(当 Tez 执行引擎执行 DAG 时)

问题场景-

  1. 我们刚刚将 hive 版本从 0.13.0 升级到 2.1.0.而且,在以前的版本中一切正常.零运行时异常.

解决问题的不同想法-

  1. 首先想到的是,由于神经网络智能,两个线程在同一块上工作.但根据以下设置

    设置 mapreduce.map.speculative=false设置 mapreduce.reduce.speculative=false

那是不可能的.

  1. 然后,我将以下设置的计数从 1000 增加到 100000 -

    设置 hive.exec.max.dynamic.partitions=100000;SET hive.exec.max.dynamic.partitions.pernode=100000;

那也没有用.

  1. 然后第三个想法是,肯定在同一个过程中,创建的mapper-1被另一个mapper/reducer删除了.但是,我们在 Hveserver2、Tez 日志中没有发现任何此类日志.

  2. 最后根本原因在于应用层代码本身.在 hive-exec-2.1.0 版本中,他们引入了新的配置属性

    "hive.exec.stagingdir":".hive-staging"

上述属性的描述 -

<块引用>

将在表位置内创建的目录名称,以便支持 HDFS 加密.这是替换 ${hive.exec.scratchdir} 的查询结果,只读表除外.在所有情况下${hive.exec.scratchdir} 仍然用于其他临时文件,例如作为工作计划.

所以如果应用层代码(ETL)中有并发作业,并且在同一个表上进行操作(重命名/删除/移动),那么可能会导致这个问题.

而且,在我们的例子中,2 个并发作业在同一个表上执行INSERT OVERWRITE",这会导致删除 1 个映射器的元数据文件,从而导致此问题.

分辨率 -

  1. 将元数据文件位置移到表外(表位于 S3).
  2. 禁用 HDFS 加密(如 stagingdir 属性描述中所述.)
  3. 更改为您的应用程序层代码以避免并发问题.

Problem -

I am running 1 query in AWS EMR. It is failing by throwing exception -

java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.

I mentioned all the related information for this problem below. Please check.

Query -

INSERT OVERWRITE TABLE base_performance_order_dedup_20160917
SELECT 
*
 FROM 
(
select
commerce_feed_redshift_dedup.sku AS sku,
commerce_feed_redshift_dedup.revenue AS revenue,
commerce_feed_redshift_dedup.orders AS orders,
commerce_feed_redshift_dedup.units AS units,
commerce_feed_redshift_dedup.feed_date AS feed_date
from commerce_feed_redshift_dedup
) tb

Exception -

ERROR Error while executing queries
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1474097800415_0311_2_00, diagnostics=[Vertex vertex_1474097800415_0311_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: commerce_feed_redshift_dedup initializer failed, vertex=vertex_1474097800415_0311_2_00 [Map 1], java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)
    at org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778)
    at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)
    at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1474097800415_0311_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1474097800415_0311_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
    at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)
    at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
    at com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)
    at azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132)
    at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)

Hive Configuration properties, that I set before executing above query. -

set hivevar:hive.mapjoin.smalltable.filesize=2000000000
set hivevar:mapreduce.map.speculative=false
set hivevar:mapreduce.output.fileoutputformat.compress=true
set hivevar:hive.exec.compress.output=true
set hivevar:mapreduce.task.timeout=6000000
set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true
set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false
set hivevar:FEED_DATE=20160917
set hivevar:hive.optimize.bucketmapjoin=true
set hivevar:hive.exec.compress.intermediate=true
set hivevar:hive.enforce.bucketmapjoin=true
set hivevar:mapred.output.compress=true
set hivevar:mapreduce.map.output.compress=true
set hivevar:hive.auto.convert.sortmerge.join=false
set hivevar:hive.auto.convert.join=false
set hivevar:mapreduce.reduce.speculative=false
set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com
set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
set hive.mapjoin.smalltable.filesize=2000000000
set mapreduce.map.speculative=false
set mapreduce.output.fileoutputformat.compress=true
set hive.exec.compress.output=true
set mapreduce.task.timeout=6000000
set hive.optimize.bucketmapjoin.sortedmerge=true
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hive.auto.convert.sortmerge.join.noconditionaltask=false
set FEED_DATE=20160917
set hive.optimize.bucketmapjoin=true
set hive.exec.compress.intermediate=true
set hive.enforce.bucketmapjoin=true 
set mapred.output.compress=true 
set mapreduce.map.output.compress=true 
set hive.auto.convert.sortmerge.join=false 
set hive.auto.convert.join=false 
set mapreduce.reduce.speculative=false 
set PD_KEY=vijay-test-mail@XXX.pagerduty.com 
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

/etc/hive/conf/hive-site.xml

<configuration>

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
<!-- that are implied by Hadoop setup variables.                                                -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource).                                                                                 -->

<!-- Hive Execution Parameters -->


<property>
  <name>hbase.zookeeper.quorum</name>
  <value>ip-172-30-2-16.us-west-2.compute.internal</value>
  <description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description>
</property>

<property>
  <name>hive.execution.engine</name>
  <value>tez</value>
</property>

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value>
  </property>


  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.mariadb.jdbc.Driver</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>username to use against metastore database</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mrN949zY9P2riCeY</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>mapred.reduce.tasks</name>
    <value>-1</value>
  </property>

  <property>
    <name>mapred.max.split.size</name>
    <value>256000000</value>
  </property>

  <property>
    <name>hive.metastore.connect.retries</name>
    <value>15</value>
  </property>

  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.async.log.enabled</name>
    <value>false</value>
  </property>

</configuration>

/etc/tez/conf/tez-site.xml

<configuration>
    <property>
    <name>tez.lib.uris</name>
    <value>hdfs:///apps/tez/tez.tar.gz</value>
  </property>

  <property>
    <name>tez.use.cluster.hadoop-libs</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.am.grouping.max-size</name>
    <value>134217728</value>
  </property>

  <property>
    <name>tez.runtime.intermediate-output.should-compress</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.runtime.intermediate-input.is-compressed</name>
    <value>true</value>
  </property>

  <property>
    <name>tez.runtime.intermediate-output.compress.codec</name>
    <value>org.apache.hadoop.io.compress.LzoCodec</value>
  </property>

  <property>
    <name>tez.runtime.intermediate-input.compress.codec</name>
    <value>org.apache.hadoop.io.compress.LzoCodec</value>
  </property>

  <property>
    <name>tez.history.logging.service.class</name>
    <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
  </property>

  <property>
    <name>tez.tez-ui.history-url.base</name>
    <value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value>
  </property>
</configuration>

Questions -

  1. Which process deleted this file ? For hive, this file should be there only. (Also, this file is not created by application code.)
  2. When I ran failed query numbers of times, it passes. Why there is ambiguous behaviour ?
  3. Since, I just upgraded hive-exec, hive-jdbc version to 2.1.0. So, it seems like some hive configuration properties wrongly set or some properties are missing. Can you help me in finding wrongly set/missed hive properties ?

Note - I upgraded hive-exec version from 0.13.0 to 2.1.0. In previous version, all queries are working fine.

Update-1

When I launch another cluster, it worked fine. I tested 3 times on the same ETL.

When I did the same thing again on new cluster, it is showing the same exception. Not able to understand, why this ambiguity is happening.

Help me to understand this ambiguity.

I am naive in dealing with Hive. So, have less conceptual idea about this.

Update-2-

hfs logs under Cluster Public DNS Name:50070 -

2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy (IPC Server handler 11 on 8020): Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 8020): IPC Server handler 11 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 172.30.2.207:56462 Call#7497 Retry#0 java.io.IOException: File /user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58/_task_tmp.-ext-10000/_tmp.000079_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)

When I searched this exception. I found this page - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo

In my cluster, there is one data node with 32 GB disk space.

** /etc/hive/conf/hive-default.xml.template - **

<property>
    <name>hive.exec.stagingdir</name>
    <value>.hive-staging</value>
    <description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description>
  </property>

Questions-

  1. As per logs, hive-staging folder is created in cluster machine, as per /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log, then why it is creating same folder in s3 also ?

Update-3-

Some exceptions are of type - LeaseExpiredException -

2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 8020): IPC Server handler 13 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab8f-d6c78124b636/.tez/application_1474442135017_0022/recovery/1/summary (inode 20326): File does not exist. Holder DFSClient_NONMAPREDUCE_1375788009_1 does not have any open files.

解决方案

I resolved the issue. Let me explain in detail.

Exceptions that is coming -

  1. LeaveExpirtedException - from HDFS side.
  2. FileNotFoundException - from Hive side (when Tez execution engine executes DAG)

Problem scenario-

  1. We just upgraded the hive version from 0.13.0 to 2.1.0. And, everything was working fine with previous version. Zero runtime exception.

Different thoughts to resolve the issue -

  1. First thought was, two threads was working on same piece because of NN intelligence. But as per below settings

    set mapreduce.map.speculative=false set mapreduce.reduce.speculative=false

that was not possible.

  1. then, I increase the count from 1000 to 100000 for below settings -

    SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000;

that also didn't work.

  1. Then the third thought was, definitely in a same process, what mapper-1 was created was deleted by another mapper/reducer. But, we didn't found any such logs in Hveserver2, Tez logs.

  2. Finally the root cause lies in a application layer code itself. In hive-exec-2.1.0 version, they introduced new configuration property

    "hive.exec.stagingdir":".hive-staging"

Description of above property -

Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.

So if there is any concurrent jobs in Application layer code (ETL), and are doing operation(rename/delete/move) on same table, then it may lead to this problem.

And, in our case, 2 concurrent jobs are doing "INSERT OVERWRITE" on same table, that leads to delete metadata file of 1 mapper, that is causing this issue.

Resolution -

  1. Move the metadata file location to outside table(table lies in S3).
  2. Disable HDFS encryption (as mentioned in Description of stagingdir property.)
  3. Change into your Application layer code to avoid concurrency issue.

这篇关于为什么 AWS EMR 中缺少 hive_staging 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆