为什么 AWS EMR 中缺少 hive_staging 文件 [英] Why hive_staging file is missing in AWS EMR
问题描述
问题 -
我正在 AWS EMR 中运行 1 个查询.它因抛出异常而失败 -
java.io.FileNotFoundException: 文件 s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-2086323283295639 不存在.
我在下面提到了这个问题的所有相关信息.请检查.
查询 -
INSERT OVERWRITE TABLE base_performance_order_dedup_20160917选择*从(选择commerce_feed_redshift_dedup.sku AS sku,commerce_feed_redshift_dedup.revenue AS 收入,commerce_feed_redshift_dedup.orders AS 订单,commerce_feed_redshift_dedup.units AS 单位,commerce_feed_redshift_dedup.feed_date AS feed_date来自 commerce_feed_redshift_dedup) 待定
例外 -
ERROR 执行查询时出错java.sql.SQLException:处理语句时出错:FAILED:执行错误,从 org.apache.hadoop.hive.ql.exec.tez.TezTask 返回代码 2.顶点失败,vertexName =地图1,vertexId = vertex_1474097800415_0311_2_00,诊断= [顶点vertex_1474097800415_0311_2_00 [图1]杀死/由于失败:ROOT_INPUT_INIT_FAILURE,顶点输入:commerce_feed_redshift_dedup初始化失败,顶点= vertex_1474097800415_0311_2_00 [图1],java.io.FileNotFoundException:文件 s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_28339333682832 不存在-6在 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)在 com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929)在 com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556)在 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)在 org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778)在 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)在 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)在 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)在 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)在 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)在 org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363)在 org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486)在 org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)在 java.security.AccessController.doPrivileged(Native Method)在 javax.security.auth.Subject.doAs(Subject.java:422)在 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)在 org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)在 java.util.concurrent.FutureTask.run(FutureTask.java:266)在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)在 java.lang.Thread.run(Thread.java:745)]顶点被杀死,vertexName=Reducer 2,vertexId=vertex_1474097800415_0311_2_01,diagnostics=[Vertex 在INITED 状态下收到Kill.,Vertex vertex_1474097800415_0311_2_01] 因[Reducer_DEXILFAILFA_DEXURE_DEX 没有成功被杀死]failedVertices:1 被杀顶点:1在 org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)在 org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)在 com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)在 com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)在 com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)在 java.lang.reflect.Method.invoke(Method.java:606)在 azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)在 azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132)在 azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)
Hive 配置属性,我在执行上述查询之前设置的属性.-
set hivevar:hive.mapjoin.smalltable.filesize=2000000000设置 hivevar:mapreduce.map.speculative=false设置 hivevar:mapreduce.output.fileoutputformat.compress=true设置 hivevar:hive.exec.compress.output=true设置 hivevar:mapreduce.task.timeout=6000000设置 hivevar:hive.optimize.bucketmapjoin.sortedmerge=true设置 hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec设置 hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat设置 hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false设置配置单元:FEED_DATE=20160917设置 hivevar:hive.optimize.bucketmapjoin=true设置 hivevar:hive.exec.compress.intermediate=true设置 hivevar:hive.enforce.bucketmapjoin=true设置 hivevar:mapred.output.compress=true设置 hivevar:mapreduce.map.output.compress=true设置 hivevar:hive.auto.convert.sortmerge.join=false设置 hivevar:hive.auto.convert.join=false设置 hivevar:mapreduce.reduce.speculative=false设置 hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com设置 hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec设置 hive.mapjoin.smalltable.filesize=2000000000设置 mapreduce.map.speculative=false设置 mapreduce.output.fileoutputformat.compress=true设置 hive.exec.compress.output=true设置 mapreduce.task.timeout=6000000设置 hive.optimize.bucketmapjoin.sortedmerge=true设置 io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec设置 hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat设置 hive.auto.convert.sortmerge.join.noconditionaltask=false设置 FEED_DATE=20160917设置 hive.optimize.bucketmapjoin=true设置 hive.exec.compress.intermediate=true设置 hive.enforce.bucketmapjoin=true设置 mapred.output.compress=true设置 mapreduce.map.output.compress=true设置 hive.auto.convert.sortmerge.join=false设置 hive.auto.convert.join=false设置 mapreduce.reduce.speculative=false设置 PD_KEY=vijay-test-mail@XXX.pagerduty.com设置 mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
/etc/hive/conf/hive-site.xml
<预><代码><配置><!-- Hive 配置可以存储在这个文件中,也可以存储在 hadoop 配置文件中 --><!-- 由 Hadoop 设置变量隐含的.--><!-- 除了 Hadoop 设置变量之外 - 提供此文件是为了方便 Hive --><!-- 用户不必编辑 hadoop 配置文件(可以作为集中管理--><!-- 资源).--><!-- Hive 执行参数--><财产><name>hbase.zookeeper.quorum</name><value>ip-172-30-2-16.us-west-2.compute.internal</value><描述>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description></属性><财产><name>hive.execution.engine</name><value>tez</value></属性><财产><name>fs.defaultFS</name><value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value></属性><财产><name>hive.metastore.uris</name><value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value><description>JDBC 元存储的 JDBC 连接字符串</description></属性><财产><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionDriverName</name><value>org.mariadb.jdbc.Driver</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionUserName</name><value>hive</value><description>用于 Metastore 数据库的用户名</description></属性><财产><name>javax.jdo.option.ConnectionPassword</name><value>mrN949zY9P2riCeY</value><description>用于 Metastore 数据库的密码</description></属性><财产><name>datanucleus.fixedDatastore</name><值>真</值></属性><财产><name>mapred.reduce.tasks</name><值>-1</值></属性><财产><name>mapred.max.split.size</name><value>256000000</value></属性><财产><name>hive.metastore.connect.retries</name><值>15</值></属性><财产><name>hive.optimize.sort.dynamic.partition</name><值>真</值></属性><财产><name>hive.async.log.enabled</name><value>false</value></属性></配置>/etc/tez/conf/tez-site.xml
<预><代码><配置><财产><name>tez.lib.uris</name><value>hdfs:///apps/tez/tez.tar.gz</value></属性><财产><name>tez.use.cluster.hadoop-libs</name><值>真</值></属性><财产><name>tez.am.grouping.max-size</name><值>134217728</值></属性><财产><name>tez.runtime.intermediate-output.should-compress</name><值>真</值></属性><财产><name>tez.runtime.intermediate-input.is-compressed</name><值>真</值></属性><财产><name>tez.runtime.intermediate-output.compress.codec</name><value>org.apache.hadoop.io.compress.LzoCodec</value></属性><财产><name>tez.runtime.intermediate-input.compress.codec</name><value>org.apache.hadoop.io.compress.LzoCodec</value></属性><财产><name>tez.history.logging.service.class</name><value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value></属性><财产><name>tez.tez-ui.history-url.base</name><value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value></属性></配置>问题 -
- 哪个进程删除了这个文件?对于 hive,这个文件应该只在那里.(此外,此文件不是由应用程序代码创建的.)
- 当我运行失败的查询次数时,它通过了.为什么会出现模棱两可的行为?
- 因为,我刚刚将 hive-exec、hive-jdbc 版本升级到 2.1.0.因此,似乎某些 hive 配置属性设置错误或缺少某些属性.你能帮我找到错误设置/遗漏的配置单元属性吗?
注意 - 我将 hive-exec 版本从 0.13.0 升级到 2.1.0.在以前的版本中,所有查询都可以正常工作.
更新 1
当我启动另一个集群时,它运行良好.我在同一个 ETL 上测试了 3 次.
当我在新集群上再次做同样的事情时,它显示了同样的异常.无法理解,为什么会出现这种歧义.
帮助我理解这种歧义.
我在与 Hive 打交道时很天真.所以,对此有较少的概念性想法.
Update-2-
hfs 记录在集群公共 DNS 名称:50070 -
2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020 上的 IPC 服务器处理程序 11):未能放置足够的副本,仍然需要 1达到 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 有关更多信息,请在 org 上启用 DEBUG 日志级别.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy(8020 上的 IPC 服务器处理程序 11):未能放置足够的副本:预期size 为 1 但只能选择 0 种存储类型(replication=1,selected=[],unavailable=[DISK],removed=[DISK],policy=BlockStoragePolicy{HOT:7,storageTypes=[DISK],creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020 上的 IPC 服务器处理程序 11):无法 place 足够的副本,仍然需要 1 才能达到 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) 所有所需的存储类型不可用:unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server(8020 上的 IPC 服务器处理程序 11):8020 上的 IPC 服务器处理程序 11,从 172.30.2.207:56462 Call#7497 Retry#0 java 调用 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock.io.IOException: 文件/user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_126928130_126928130_mp30_12238130_12238t30_12238t30 只能复制_12238130d-task_mp_internal_temp_full_load_offer_flexibility_20160920节点而不是 minReplication (=1).有 1 个数据节点正在运行,此操作中没有排除任何节点.在 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) 在 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) 在 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) 在 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) 在 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) 位于 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.apache.java.hdfs.ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) 在 org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) 在 org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
当我搜索此异常时.我找到了这个页面 - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
在我的集群中,有一个具有 32 GB 磁盘空间的数据节点.
**/etc/hive/conf/hive-default.xml.template - **
<name>hive.exec.stagingdir</name><value>.hive-staging</value> 将在表位置内创建以支持 HDFS 加密的目录名称.这是替换 ${hive.exec.scratchdir} 的查询结果,只读表除外.在所有情况下,${hive.exec.scratchdir} 仍用于其他临时文件,例如作业计划.</description></属性>
问题-
- 根据日志,在集群机器中创建 hive-staging 文件夹,根据 /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log,那么为什么它也在 s3 中创建相同的文件夹?
Update-3-
一些异常类型 - LeaseExpiredException -
2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server(8020 上的 IPC 服务器处理程序 13):8020 上的 IPC 服务器处理程序 13,调用 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete 来自172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 没有租约/tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab7631b4b4/application_1474442135017_0022/recovery/1/summary(inode 20326):文件不存在.持有人 DFSClient_NONMAPREDUCE_1375788009_1 没有任何打开的文件.
我解决了这个问题.让我详细解释一下.
即将发生的异常 -
- LeaveExpirtedException - 来自 HDFS 端.
- FileNotFoundException - 来自 Hive 端(当 Tez 执行引擎执行 DAG 时)
问题场景-
- 我们刚刚将 hive 版本从 0.13.0 升级到 2.1.0.而且,在以前的版本中一切正常.零运行时异常.
解决问题的不同想法-
首先想到的是,由于神经网络智能,两个线程在同一块上工作.但根据以下设置
设置 mapreduce.map.speculative=false设置 mapreduce.reduce.speculative=false
那是不可能的.
然后,我将以下设置的计数从 1000 增加到 100000 -
设置 hive.exec.max.dynamic.partitions=100000;SET hive.exec.max.dynamic.partitions.pernode=100000;
那也没有用.
然后第三个想法是,肯定在同一个过程中,创建的mapper-1被另一个mapper/reducer删除了.但是,我们在 Hveserver2、Tez 日志中没有发现任何此类日志.
最后根本原因在于应用层代码本身.在 hive-exec-2.1.0 版本中,他们引入了新的配置属性
"hive.exec.stagingdir":".hive-staging"
上述属性的描述 -
<块引用>将在表位置内创建的目录名称,以便支持 HDFS 加密.这是替换 ${hive.exec.scratchdir} 的查询结果,只读表除外.在所有情况下${hive.exec.scratchdir} 仍然用于其他临时文件,例如作为工作计划.
所以如果应用层代码(ETL)中有并发作业,并且在同一个表上进行操作(重命名/删除/移动),那么可能会导致这个问题.
而且,在我们的例子中,2 个并发作业在同一个表上执行INSERT OVERWRITE",这会导致删除 1 个映射器的元数据文件,从而导致此问题.
分辨率 -
- 将元数据文件位置移到表外(表位于 S3).
- 禁用 HDFS 加密(如 stagingdir 属性描述中所述.)
- 更改为您的应用程序层代码以避免并发问题.
Problem -
I am running 1 query in AWS EMR. It is failing by throwing exception -
java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
I mentioned all the related information for this problem below. Please check.
Query -
INSERT OVERWRITE TABLE base_performance_order_dedup_20160917
SELECT
*
FROM
(
select
commerce_feed_redshift_dedup.sku AS sku,
commerce_feed_redshift_dedup.revenue AS revenue,
commerce_feed_redshift_dedup.orders AS orders,
commerce_feed_redshift_dedup.units AS units,
commerce_feed_redshift_dedup.feed_date AS feed_date
from commerce_feed_redshift_dedup
) tb
Exception -
ERROR Error while executing queries
java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1474097800415_0311_2_00, diagnostics=[Vertex vertex_1474097800415_0311_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: commerce_feed_redshift_dedup initializer failed, vertex=vertex_1474097800415_0311_2_00 [Map 1], java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)
at org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486)
at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1474097800415_0311_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1474097800415_0311_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251)
at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)
at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
at com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)
at azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132)
at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)
Hive Configuration properties, that I set before executing above query. -
set hivevar:hive.mapjoin.smalltable.filesize=2000000000
set hivevar:mapreduce.map.speculative=false
set hivevar:mapreduce.output.fileoutputformat.compress=true
set hivevar:hive.exec.compress.output=true
set hivevar:mapreduce.task.timeout=6000000
set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true
set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false
set hivevar:FEED_DATE=20160917
set hivevar:hive.optimize.bucketmapjoin=true
set hivevar:hive.exec.compress.intermediate=true
set hivevar:hive.enforce.bucketmapjoin=true
set hivevar:mapred.output.compress=true
set hivevar:mapreduce.map.output.compress=true
set hivevar:hive.auto.convert.sortmerge.join=false
set hivevar:hive.auto.convert.join=false
set hivevar:mapreduce.reduce.speculative=false
set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com
set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
set hive.mapjoin.smalltable.filesize=2000000000
set mapreduce.map.speculative=false
set mapreduce.output.fileoutputformat.compress=true
set hive.exec.compress.output=true
set mapreduce.task.timeout=6000000
set hive.optimize.bucketmapjoin.sortedmerge=true
set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hive.auto.convert.sortmerge.join.noconditionaltask=false
set FEED_DATE=20160917
set hive.optimize.bucketmapjoin=true
set hive.exec.compress.intermediate=true
set hive.enforce.bucketmapjoin=true
set mapred.output.compress=true
set mapreduce.map.output.compress=true
set hive.auto.convert.sortmerge.join=false
set hive.auto.convert.join=false
set mapreduce.reduce.speculative=false
set PD_KEY=vijay-test-mail@XXX.pagerduty.com
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
/etc/hive/conf/hive-site.xml
<configuration>
<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files -->
<!-- that are implied by Hadoop setup variables. -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource). -->
<!-- Hive Execution Parameters -->
<property>
<name>hbase.zookeeper.quorum</name>
<value>ip-172-30-2-16.us-west-2.compute.internal</value>
<description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description>
</property>
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.mariadb.jdbc.Driver</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mrN949zY9P2riCeY</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>mapred.max.split.size</name>
<value>256000000</value>
</property>
<property>
<name>hive.metastore.connect.retries</name>
<value>15</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>true</value>
</property>
<property>
<name>hive.async.log.enabled</name>
<value>false</value>
</property>
</configuration>
/etc/tez/conf/tez-site.xml
<configuration>
<property>
<name>tez.lib.uris</name>
<value>hdfs:///apps/tez/tez.tar.gz</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>true</value>
</property>
<property>
<name>tez.am.grouping.max-size</name>
<value>134217728</value>
</property>
<property>
<name>tez.runtime.intermediate-output.should-compress</name>
<value>true</value>
</property>
<property>
<name>tez.runtime.intermediate-input.is-compressed</name>
<value>true</value>
</property>
<property>
<name>tez.runtime.intermediate-output.compress.codec</name>
<value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>
<property>
<name>tez.runtime.intermediate-input.compress.codec</name>
<value>org.apache.hadoop.io.compress.LzoCodec</value>
</property>
<property>
<name>tez.history.logging.service.class</name>
<value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
</property>
<property>
<name>tez.tez-ui.history-url.base</name>
<value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value>
</property>
</configuration>
Questions -
- Which process deleted this file ? For hive, this file should be there only. (Also, this file is not created by application code.)
- When I ran failed query numbers of times, it passes. Why there is ambiguous behaviour ?
- Since, I just upgraded hive-exec, hive-jdbc version to 2.1.0. So, it seems like some hive configuration properties wrongly set or some properties are missing. Can you help me in finding wrongly set/missed hive properties ?
Note - I upgraded hive-exec version from 0.13.0 to 2.1.0. In previous version, all queries are working fine.
Update-1
When I launch another cluster, it worked fine. I tested 3 times on the same ETL.
When I did the same thing again on new cluster, it is showing the same exception. Not able to understand, why this ambiguity is happening.
Help me to understand this ambiguity.
I am naive in dealing with Hive. So, have less conceptual idea about this.
Update-2-
hfs logs under Cluster Public DNS Name:50070 -
2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy (IPC Server handler 11 on 8020): Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 8020): IPC Server handler 11 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 172.30.2.207:56462 Call#7497 Retry#0 java.io.IOException: File /user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58/_task_tmp.-ext-10000/_tmp.000079_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
When I searched this exception. I found this page - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
In my cluster, there is one data node with 32 GB disk space.
** /etc/hive/conf/hive-default.xml.template - **
<property>
<name>hive.exec.stagingdir</name>
<value>.hive-staging</value>
<description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description>
</property>
Questions-
- As per logs, hive-staging folder is created in cluster machine, as per /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log, then why it is creating same folder in s3 also ?
Update-3-
Some exceptions are of type - LeaseExpiredException -
2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 8020): IPC Server handler 13 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from
172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab8f-d6c78124b636/.tez/application_1474442135017_0022/recovery/1/summary (inode 20326): File does not exist. Holder DFSClient_NONMAPREDUCE_1375788009_1 does not have any open files.
I resolved the issue. Let me explain in detail.
Exceptions that is coming -
- LeaveExpirtedException - from HDFS side.
- FileNotFoundException - from Hive side (when Tez execution engine executes DAG)
Problem scenario-
- We just upgraded the hive version from 0.13.0 to 2.1.0. And, everything was working fine with previous version. Zero runtime exception.
Different thoughts to resolve the issue -
First thought was, two threads was working on same piece because of NN intelligence. But as per below settings
set mapreduce.map.speculative=false set mapreduce.reduce.speculative=false
that was not possible.
then, I increase the count from 1000 to 100000 for below settings -
SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000;
that also didn't work.
Then the third thought was, definitely in a same process, what mapper-1 was created was deleted by another mapper/reducer. But, we didn't found any such logs in Hveserver2, Tez logs.
Finally the root cause lies in a application layer code itself. In hive-exec-2.1.0 version, they introduced new configuration property
"hive.exec.stagingdir":".hive-staging"
Description of above property -
Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.
So if there is any concurrent jobs in Application layer code (ETL), and are doing operation(rename/delete/move) on same table, then it may lead to this problem.
And, in our case, 2 concurrent jobs are doing "INSERT OVERWRITE" on same table, that leads to delete metadata file of 1 mapper, that is causing this issue.
Resolution -
- Move the metadata file location to outside table(table lies in S3).
- Disable HDFS encryption (as mentioned in Description of stagingdir property.)
- Change into your Application layer code to avoid concurrency issue.
这篇关于为什么 AWS EMR 中缺少 hive_staging 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!