为什么AWS EMR中缺少hive_staging文件 [英] Why hive_staging file is missing in AWS EMR
问题描述
问题 -
我在AWS EMR中运行1个查询。它通过抛出异常失败 -
java.io.FileNotFoundException:文件s3:// xxx / yyy / internal_test_automation / 2016 / 09/17/17156 / data / feed / commerce_feed_redshift_dedup / .hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639不存在。
我在下面提到了这个问题的所有相关信息。请检查。
查询 -
INSERT OVERWRITE TABLE base_performance_order_dedup_20160917
选择
*
起价
(
选择
commerce_feed_redshift_dedup.sku AS SKU,
commerce_feed_redshift_dedup.revenue收入,
commerce_feed_redshift_dedup.orders AS订单,
commerce_feed_redshift_dedup.units AS单元,
commerce_feed_redshift_dedup.feed_date AS feed_date
。从commerce_feed_redshift_dedup
)中TB
$ b例外 -
ERROR执行查询时出错
java.sql.SQLException:处理语句时出错:FAILED:执行错误,从org.apache.hadoop.hive.ql.exec.tez.TezTask返回代码2。顶点失败,vertexName =地图1,vertexId = vertex_1474097800415_0311_2_00,诊断= [顶点vertex_1474097800415_0311_2_00 [地图1]死亡/失败归因于:ROOT_INPUT_INIT_FAILURE,顶点输入:commerce_feed_redshift_dedup初始化失败,顶点= vertex_1474097800415_0311_2_00 [Map 1],java.io.FileNotFoundException:文件S3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639不存在。
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem。 listStatus(S3NativeFileSystem.java:929)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem。 listStatus(FileSystem.java:1530)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem。
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601)
at org.apache.hadoop.fs.FileSystem $ 4。(FileSystem.java:1778)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755)
在org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239)
在org.apache.hadoop.mapred.FileInputFormat.lis tStatus(FileInputFormat.java:201)
位于org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281)
位于org.apache.hadoop.hive.ql.io.HiveInputFormat。在org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits addSplitsForGroup(HiveInputFormat.java:363)
(HiveInputFormat.java:486)
。在org.apache.hadoop.hive.ql。 exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200)
在org.apache.tez.dag.app.dag.RootInputInitializerManager $ InputInitializerCallable $ 1.run(RootInputInitializerManager.java:278)
。在组织$ javax.security.auth中的b $ b .Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.tez.dag.app.dag .RootInputInitializerManager $ InputIn itializerCallable.call(RootInputInitializerManager.java:269)
位于org.apache.tez.dag.app.dag.RootInputInitializerManager $ InputInitializerCallable.call(RootInputInitializerManager.java:253)
位于java.util.concurrent。 FutureTask.run(FutureTask.java:266)$ java.util.concurrent.ThreadPoolExecutor.runWorker中的b $ b(ThreadPoolExecutor.java:1142)$ java.util.concurrent.ThreadPoolExecutor中的
$ Worker.run(ThreadPoolExecutor。
at java.lang.Thread.run(Thread.java:745)
]顶点被杀死,vertexName = Reducer 2,vertexId = vertex_1474097800415_0311_2_01,diagnostics = [顶点收到INITED状态。 ,顶点vertex_1474097800415_0311_2_01 [减速器2]因以下原因而死亡/失败:OTHER_VERTEX_FAILURE]由于VERTEX_FAILURE,DAG未成功。 failedVertices:1个killedVertices:1个
。在org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348)
。在org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:在com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234 251)
)
。在com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
。在com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68)
在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)
。在sun.reflect。 NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
在java.lang.reflect.Method.invoke(Method.java:606)
at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192)
at azkaban.jobtype.JavaJobRunnerMain。(JavaJobRunne rMain.java:132)
at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)
配置属性,我在执行上面的查询之前设置的。 -
设置hivevar:hive.mapjoin.smalltable.filesize = 2000000000
设置hivevar:mapreduce。 map.speculative = false
设置hivevar:mapreduce.output.fileoutputformat.compress = true
设置hivevar:hive.exec.compress.output = true
设置hivevar:mapreduce.task.timeout = 6000000
设置hivevar:hive.optimize.bucketmapjoin.sortedmerge = true
设置hivevar:io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec
设置hivevar:hive。 input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask = false
set hivevar:FEED_DATE = 20160917
设置hivevar:hive.optimize.bucketmapjoin = true
设置hivevar:hive.exec.compress.intermediate = true
设置hivevar:hive.enforce.bucketmapjoin = true
设置hivevar:mapred。 output.compress = true
设置hivevar:mapreduce.map.output.compress = true
设置hivevar:hive.auto.convert.sortmerge.join = false
设置hivevar:hive.auto。 convert.join = false
set hivevar:mapreduce.reduce.speculative = false
set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com
set hivevar:mapred.output.compression.codec = org.apache。 hadoop.io.compress.GzipCodec
set hive.mapjoin.smalltable.filesize = 2000000000
set mapreduce.map.speculative = false
set mapreduce.output.fileoutputformat.compress = true
set hive.exec.compress.output = true
set mapreduce.task.timeout = 6000000
set hive.optimize.bucketmapjoin.sortedmerge = true
set io.compression.codecs = org。 apache.hadoop.io.compress.GzipCodec
set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat
set hive.auto.convert.sortmerge.join.noconditionaltask = false
set FEED_DATE = 20160917
set hive.optimize.bucketmapjoin = true
set hive.exec.compress.intermediate = true
set hive.enforce.bucketmapjoin = true
set mapred.output.compress = true
set mapreduce.map.output.compress = true
set hive.auto.convert.sortmerge.join = false
set h ive.auto.convert.join = false
设置mapreduce.reduce.speculative = false
设置PD_KEY=vijay-test-mail@XXX.pagerduty.com
设置mapred.output.compression。 codec = org.apache.hadoop.io.compress.GzipCodec
/ etc / hive / conf / hive-site.xml< configuration>
<! - Hive配置既可以存储在这个文件中,也可以存储在hadoop配置文件中 - >
<! - - 它们是Hadoop设置变量所隐含的。 - >
<! - 除Hadoop设置变量外 - 提供此文件作为一种便利,以便Hive - >
<! - 用户不必编辑hadoop配置文件(可以将其作为集中管理的>
<! - 资源进行管理)。 - >
<! - Hive执行参数 - >
<属性>
<名称> hbase.zookeeper.quorum< / name>
< value> ip-172-30-2-16.us-west-2.compute.internal< / value>
< description> http://wiki.apache.org/hadoop/Hive/HBaseIntegration< / description>
< / property>
<属性>
<名称> hive.execution.engine< /名称>
<值> tez< /值>
< / property>
<属性>
<名称> fs.defaultFS< / name>
< value> hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020< / value>
< / property>
<属性>
< name> hive.metastore.uris< / name>
< value> thrift://ip-172-30-2-16.us-west-2.compute.internal:9083< / value>
< description> JDBC元数据的JDBC连接字符串< / description>
< / property>
<属性>
< name> javax.jdo.option.ConnectionURL< / name>
< value> jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306 / hive?createDatabaseIfNotExist = true< / value>
< description>用于针对Metastore数据库的用户名< / description>
< / property>
<属性>
< name> javax.jdo.option.ConnectionDriverName< / name>
< value> org.mariadb.jdbc.Driver< / value>
< description>用于针对Metastore数据库的用户名< / description>
< / property>
<属性>
< name> javax.jdo.option.ConnectionUserName< / name>
< value>配置单元< /值>
< description>用于针对Metastore数据库的用户名< / description>
< / property>
<属性>
< name> javax.jdo.option.ConnectionPassword< / name>
<值> mrN949zY9P2riCeY< /值>
< description>密码将用于Metastore数据库< / description>
< / property>
<属性>
< name> datanucleus.fixedDatastore< / name>
<值> true< /值>
< / property>
<属性>
<名称> mapred.reduce.tasks< / name>
<值> -1< /值>
< / property>
<属性>
<名称> mapred.max.split.size< / name>
<值> 256000000< /值>
< / property>
<属性>
< name> hive.metastore.connect.retries< / name>
<值> 15< /值>
< / property>
<属性>
< name> hive.optimize.sort.dynamic.partition< / name>
<值> true< /值>
< / property>
<属性>
< name> hive.async.log.enabled< / name>
<值> false< /值>
< / property>
< / configuration>
/etc/tez/conf/tez-site.xml p>
< configuration>
<属性>
< name> tez.lib.uris< / name>
< value> hdfs:///apps/tez/tez.tar.gz< /值>
< / property>
<属性>
< name> tez.use.cluster.hadoop-libs< / name>
<值> true< /值>
< / property>
<属性>
<名称> tez.am.grouping.max-size< / name>
<值> 134217728< /值>
< / property>
<属性>
< name> tez.runtime.intermediate-output.should-compress< / name>
<值> true< /值>
< / property>
<属性>
< name> tez.runtime.intermediate-input.is-compressed< / name>
<值> true< /值>
< / property>
<属性>
<名称> tez.runtime.intermediate-output.compress.codec< / name>
< value> org.apache.hadoop.io.compress.LzoCodec< / value>
< / property>
<属性>
<名称> tez.runtime.intermediate-input.compress.codec< / name>
< value> org.apache.hadoop.io.compress.LzoCodec< / value>
< / property>
<属性>
<名称> tez.history.logging.service.class< /名称>
<值> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService< /值>
< / property>
<属性>
< name> tez.tez-ui.history-url.base< / name>
< value> http://ip-172-30-2-16.us-west-2.compute.internal:8080 / tez-ui /< / value>
< / property>
< / configuration>
问题 -
- 哪个进程删除了这个文件?对于配置单元,这个文件应该只在那里。 (另外,这个文件不是由应用程序代码创建的。)
- 当我运行失败的查询次数时,它通过。为什么有不明确的行为?
- 因为我只是将hive-exec,hive-jdbc版本升级到2.1.0。所以,它似乎像一些配置属性错误设置或某些属性丢失。你能帮我找到错误设置/遗漏的配置单元属性吗?
注意 - 我升级了0.13的hive-exec版本。 0至2.1.0。在以前的版本中,所有查询都可以正常工作。
Update-1
当我启动另一个群集时,它工作正常。我在同一个ETL上测试了3次。
当我在新群集上再次执行相同的操作时,它显示相同的异常。无法理解,为什么这种歧义正在发生。
帮助我理解这种模糊性。
我在处理Hive方面很天真。因此,对此有较少的概念性意见。
更新-2 - 名称:50070 -
2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server。 blockmanagement.BlockPlacementPolicy(8020上的IPC服务器处理程序11):无法放置足够的副本,仍然需要1来达到1(unavailableStorages = [],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [] ,replicationFallbacks = [ARCHIVE]},newBlock = true)有关更多信息,请在org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy上启用DEBUG日志级别2016-09-20 11:31:55,155 WARN org.apache。 hadoop.hdfs.protocol.BlockStoragePolicy(8020上的IPC服务器处理程序11):未能放置足够的副本:预期大小为1,但只能选择0种存储类型(replication = 1,selected = [],unavailable = [DISK], removed = [DISK],policy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]} )2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy(8020上的IPC服务器处理程序11):未能放置足够的副本,仍然需要1来达到1(unavailableStorages = [DISK],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]},newBlock = true)所有必需的存储类型不可用:unavailableStorages = [DISK],storagePolicy = BlockStoragePolicy {HOT:7,storageTypes = [DISK],creationFallbacks = [],replicationFallbacks = [ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server(8020上的IPC服务器处理程序11 ):8020上的IPC服务器处理程序11,从172.30.2.207:56462调用org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock调用#7497重试#0 java.io.IOException:File / user / hive / warehouse / bc_kmart_3813 .db / dp_internal_temp_full_load_offer_flexibility_20160920 / .hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58 / _task_tmp.-ext-10000 / _tmp.000079_0只能是复制到0节点而不是minReplication(= 1)。有1个数据节点正在运行,并且在此操作中不包含任何节点。在在org.apache org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547)在org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) .hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)在org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724)在org.apache.hadoop.hdfs .protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos $ ClientNamenodeProtocol $ 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)at org.apache.hadoop.ipc.ProtobufRpcEngine $ Server $ ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)at org.apache.hadoop.ipc.RPC $ Server.call(RPC.java:969)at org.apache.hadoop.ipc.Server $ Handler $ 1.run(Server.java :2049)at org.apac在org上的javax.security.auth.Subject.doAs(Subject.java:422)上的java.security.AccessController.doPrivileged(Native方法)中的he.hadoop.ipc.Server $ Handler $ 1.run(Server.java:2045) .apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)at org.apache.hadoop.ipc.Server $ Handler.run(Server.java:2043)
当我搜索这个例外。我发现这个网页 - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
在我的群集中,有一个数据节点具有32 GB的磁盘空间。
$ b ** / etc / hive / conf / hive-default。 xml.template - **< property>
<名称> hive.exec.stagingdir< /名称>
<值> .hive-staging< /值>
< description>将在表格位置内创建的目录名称,以支持HDFS加密。除了只读表,这取代了查询结果的$ {hive.exec.scratchdir}。在任何情况下,$ {hive.exec.scratchdir}仍然用于其他临时文件,例如作业计划。< / description>
< / property>
问题 -
- 根据日志,hive-staging文件夹在群集机器中创建,如 / var / log / hadoop-hdfs / hadoop-hdfs-datanode-ip-172-30- 2-189.log ,那么它为什么在s3中创建相同的文件夹呢?
Update-3 -
一些例外类型 - LeaseExpiredException -
2016 -09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server(8020上的IPC服务器处理程序13):8020上的IPC服务器处理程序13,调用org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from
172.30.2.189:42958呼叫#726#重试0:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:在/ tmp /蜂巢/ Hadoop的没有租期/ _tez_session_dir / 6ebd2d18-f5b9-4176-ab8f- d6c78124b636 / .tez / application_1474442135017_0022 / recovery / 1 / summary(inode 20326):文件不存在。持有人DFSClient_NONMAPREDUCE_1375788009_1没有任何打开的文件。解决方案我解决了问题。让我详细解释一下。
即将发生的异常 -
- LeaveExpirtedException - 来自HDFS端。
- FileNotFoundException - 来自Hive端(当Tez执行引擎执行DAG时)
问题场景 -
- 我们将蜂巢版本从0.13.0升级到2.1.0。而且,以前的版本一切正常。零运行时例外。
解决此问题的不同想法 -
首先想到的是,由于NN的智能性,两个线程正在处理同一个片段。但根据以下设置
设置mapreduce.map.speculative = false
设置mapreduce.reduce.speculative = false
这是不可能的。
然后,我将计数从1000增加到100000以下设置 -
SET hive.exec.max.dynamic.partitions = 100000;
SET hive.exec.max.dynamic.partitions.pernode = 100000;
t
然后第三个想法是,在同一个过程中,mapper-1是创建的文件被另一个映射器/缩减器删除。但是,我们在Hveserver2,Tez日志中没有发现任何这样的日志。最后,根本原因在于应用程序层代码本身。在hive-exec-2.1.0版本中,他们引入了新的配置属性
hive.exec.stagingdir:.hive-staging
上述属性描述 -
目录名称这将在表格位置内创建,以便
支持HDFS加密。这将取代$ {hive.exec.scratchdir}用于
查询结果,但只读表格除外。在任何情况下,
$ {hive.exec.scratchdir}仍然用于其他临时文件,例如
作为工作计划。
因此,如果应用程序层代码(ETL)中有任何并发作业,并且正在同一个表上执行操作(重命名/删除/移动),则可能会导致此问题。
在我们的例子中,2个并发作业在同一张表上执行INSERT OVERWRITE,导致删除1个映射器的元数据文件,导致此问题。
解析 -
- 将元数据文件位置移至外部表格中(表格位于S3中)。
>
- 禁用HDFS加密(如stagingdir属性描述中所述)。
- 更改为应用程序层代码以避免并发问题。
Problem -
I am running 1 query in AWS EMR. It is failing by throwing exception -
java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist.
I mentioned all the related information for this problem below. Please check.
Query -
INSERT OVERWRITE TABLE base_performance_order_dedup_20160917 SELECT * FROM ( select commerce_feed_redshift_dedup.sku AS sku, commerce_feed_redshift_dedup.revenue AS revenue, commerce_feed_redshift_dedup.orders AS orders, commerce_feed_redshift_dedup.units AS units, commerce_feed_redshift_dedup.feed_date AS feed_date from commerce_feed_redshift_dedup ) tb
Exception -
ERROR Error while executing queries java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1474097800415_0311_2_00, diagnostics=[Vertex vertex_1474097800415_0311_2_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: commerce_feed_redshift_dedup initializer failed, vertex=vertex_1474097800415_0311_2_00 [Map 1], java.io.FileNotFoundException: File s3://xxx/yyy/internal_test_automation/2016/09/17/17156/data/feed/commerce_feed_redshift_dedup/.hive-staging_hive_2016-09-17_10-24-20_998_2833938482542362802-639 does not exist. at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:987) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:929) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:339) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1530) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1556) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1601) at org.apache.hadoop.fs.FileSystem$4.(FileSystem.java:1778) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1777) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1755) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:239) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:281) at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:363) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:486) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:200) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ]Vertex killed, vertexName=Reducer 2, vertexId=vertex_1474097800415_0311_2_01, diagnostics=[Vertex received Kill in INITED state., Vertex vertex_1474097800415_0311_2_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1 at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:348) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:251) at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234) at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184) at com.XXX.YYY.azkaban.jobexecutors.impl.AzkabanHiveQueryExecutor.run(AzkabanHiveQueryExecutor.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at azkaban.jobtype.JavaJobRunnerMain.runMethod(JavaJobRunnerMain.java:192) at azkaban.jobtype.JavaJobRunnerMain.(JavaJobRunnerMain.java:132) at azkaban.jobtype.JavaJobRunnerMain.main(JavaJobRunnerMain.java:76)
Hive Configuration properties, that I set before executing above query. -
set hivevar:hive.mapjoin.smalltable.filesize=2000000000 set hivevar:mapreduce.map.speculative=false set hivevar:mapreduce.output.fileoutputformat.compress=true set hivevar:hive.exec.compress.output=true set hivevar:mapreduce.task.timeout=6000000 set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=false set hivevar:FEED_DATE=20160917 set hivevar:hive.optimize.bucketmapjoin=true set hivevar:hive.exec.compress.intermediate=true set hivevar:hive.enforce.bucketmapjoin=true set hivevar:mapred.output.compress=true set hivevar:mapreduce.map.output.compress=true set hivevar:hive.auto.convert.sortmerge.join=false set hivevar:hive.auto.convert.join=false set hivevar:mapreduce.reduce.speculative=false set hivevar:PD_KEY=vijay-test-mail@XXX.pagerduty.com set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec set hive.mapjoin.smalltable.filesize=2000000000 set mapreduce.map.speculative=false set mapreduce.output.fileoutputformat.compress=true set hive.exec.compress.output=true set mapreduce.task.timeout=6000000 set hive.optimize.bucketmapjoin.sortedmerge=true set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat set hive.auto.convert.sortmerge.join.noconditionaltask=false set FEED_DATE=20160917 set hive.optimize.bucketmapjoin=true set hive.exec.compress.intermediate=true set hive.enforce.bucketmapjoin=true set mapred.output.compress=true set mapreduce.map.output.compress=true set hive.auto.convert.sortmerge.join=false set hive.auto.convert.join=false set mapreduce.reduce.speculative=false set PD_KEY=vijay-test-mail@XXX.pagerduty.com set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
/etc/hive/conf/hive-site.xml
<configuration> <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files --> <!-- that are implied by Hadoop setup variables. --> <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive --> <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized --> <!-- resource). --> <!-- Hive Execution Parameters --> <property> <name>hbase.zookeeper.quorum</name> <value>ip-172-30-2-16.us-west-2.compute.internal</value> <description>http://wiki.apache.org/hadoop/Hive/HBaseIntegration</description> </property> <property> <name>hive.execution.engine</name> <value>tez</value> </property> <property> <name>fs.defaultFS</name> <value>hdfs://ip-172-30-2-16.us-west-2.compute.internal:8020</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://ip-172-30-2-16.us-west-2.compute.internal:9083</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://ip-172-30-2-16.us-west-2.compute.internal:3306/hive?createDatabaseIfNotExist=true</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>org.mariadb.jdbc.Driver</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>mrN949zY9P2riCeY</value> <description>password to use against metastore database</description> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>mapred.reduce.tasks</name> <value>-1</value> </property> <property> <name>mapred.max.split.size</name> <value>256000000</value> </property> <property> <name>hive.metastore.connect.retries</name> <value>15</value> </property> <property> <name>hive.optimize.sort.dynamic.partition</name> <value>true</value> </property> <property> <name>hive.async.log.enabled</name> <value>false</value> </property> </configuration>
/etc/tez/conf/tez-site.xml
<configuration> <property> <name>tez.lib.uris</name> <value>hdfs:///apps/tez/tez.tar.gz</value> </property> <property> <name>tez.use.cluster.hadoop-libs</name> <value>true</value> </property> <property> <name>tez.am.grouping.max-size</name> <value>134217728</value> </property> <property> <name>tez.runtime.intermediate-output.should-compress</name> <value>true</value> </property> <property> <name>tez.runtime.intermediate-input.is-compressed</name> <value>true</value> </property> <property> <name>tez.runtime.intermediate-output.compress.codec</name> <value>org.apache.hadoop.io.compress.LzoCodec</value> </property> <property> <name>tez.runtime.intermediate-input.compress.codec</name> <value>org.apache.hadoop.io.compress.LzoCodec</value> </property> <property> <name>tez.history.logging.service.class</name> <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value> </property> <property> <name>tez.tez-ui.history-url.base</name> <value>http://ip-172-30-2-16.us-west-2.compute.internal:8080/tez-ui/</value> </property> </configuration>
Questions -
- Which process deleted this file ? For hive, this file should be there only. (Also, this file is not created by application code.)
- When I ran failed query numbers of times, it passes. Why there is ambiguous behaviour ?
- Since, I just upgraded hive-exec, hive-jdbc version to 2.1.0. So, it seems like some hive configuration properties wrongly set or some properties are missing. Can you help me in finding wrongly set/missed hive properties ?
Note - I upgraded hive-exec version from 0.13.0 to 2.1.0. In previous version, all queries are working fine.
Update-1
When I launch another cluster, it worked fine. I tested 3 times on the same ETL.
When I did the same thing again on new cluster, it is showing the same exception. Not able to understand, why this ambiguity is happening.
Help me to understand this ambiguity.
I am naive in dealing with Hive. So, have less conceptual idea about this.
Update-2-
hfs logs under Cluster Public DNS Name:50070 -
2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy (IPC Server handler 11 on 8020): Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) 2016-09-20 11:31:55,155 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy (IPC Server handler 11 on 8020): Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} 2016-09-20 11:31:55,155 INFO org.apache.hadoop.ipc.Server (IPC Server handler 11 on 8020): IPC Server handler 11 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 172.30.2.207:56462 Call#7497 Retry#0 java.io.IOException: File /user/hive/warehouse/bc_kmart_3813.db/dp_internal_temp_full_load_offer_flexibility_20160920/.hive-staging_hive_2016-09-20_11-17-51_558_1222354063413369813-58/_task_tmp.-ext-10000/_tmp.000079_0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1547) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:724) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
When I searched this exception. I found this page - https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
In my cluster, there is one data node with 32 GB disk space.
** /etc/hive/conf/hive-default.xml.template - **
<property> <name>hive.exec.stagingdir</name> <value>.hive-staging</value> <description>Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.</description> </property>
Questions-
- As per logs, hive-staging folder is created in cluster machine, as per /var/log/hadoop-hdfs/hadoop-hdfs-datanode-ip-172-30-2-189.log, then why it is creating same folder in s3 also ?
Update-3-
Some exceptions are of type - LeaseExpiredException -
2016-09-21 08:53:17,995 INFO org.apache.hadoop.ipc.Server (IPC Server handler 13 on 8020): IPC Server handler 13 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from 172.30.2.189:42958 Call#726 Retry#0: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive/hadoop/_tez_session_dir/6ebd2d18-f5b9-4176-ab8f-d6c78124b636/.tez/application_1474442135017_0022/recovery/1/summary (inode 20326): File does not exist. Holder DFSClient_NONMAPREDUCE_1375788009_1 does not have any open files.
解决方案I resolved the issue. Let me explain in detail.
Exceptions that is coming -
- LeaveExpirtedException - from HDFS side.
- FileNotFoundException - from Hive side (when Tez execution engine executes DAG)
Problem scenario-
- We just upgraded the hive version from 0.13.0 to 2.1.0. And, everything was working fine with previous version. Zero runtime exception.
Different thoughts to resolve the issue -
First thought was, two threads was working on same piece because of NN intelligence. But as per below settings
set mapreduce.map.speculative=false set mapreduce.reduce.speculative=false
that was not possible.
then, I increase the count from 1000 to 100000 for below settings -
SET hive.exec.max.dynamic.partitions=100000; SET hive.exec.max.dynamic.partitions.pernode=100000;
that also didn't work.
Then the third thought was, definitely in a same process, what mapper-1 was created was deleted by another mapper/reducer. But, we didn't found any such logs in Hveserver2, Tez logs.
Finally the root cause lies in a application layer code itself. In hive-exec-2.1.0 version, they introduced new configuration property
"hive.exec.stagingdir":".hive-staging"
Description of above property -
Directory name that will be created inside table locations in order to support HDFS encryption. This is replaces ${hive.exec.scratchdir} for query results with the exception of read-only tables. In all cases ${hive.exec.scratchdir} is still used for other temporary files, such as job plans.
So if there is any concurrent jobs in Application layer code (ETL), and are doing operation(rename/delete/move) on same table, then it may lead to this problem.
And, in our case, 2 concurrent jobs are doing "INSERT OVERWRITE" on same table, that leads to delete metadata file of 1 mapper, that is causing this issue.
Resolution -
- Move the metadata file location to outside table(table lies in S3).
- Disable HDFS encryption (as mentioned in Description of stagingdir property.)
- Change into your Application layer code to avoid concurrency issue.
这篇关于为什么AWS EMR中缺少hive_staging文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!