HBase Master 无法启动 [英] HBase Master won't start

查看:127
本文介绍了HBase Master 无法启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 CDH 集群 5.7.0 中运行 HBase.几个月没有任何问题运行后,hbase 服务停止了,现在无法启动 HBase 主服务器(1 个主服务器和 4 个区域服务器).

当我尝试在某个时候启动它时,机器挂起,我在主日志中看到的最后一件事是:

2016-10-24 12:17:15,150 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-0000000000500280000日志2016-10-24 12:17:15,152 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000000000005 之后2ms2016-10-24 12:17:15,177 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.2016-10-24 12:17:15,179 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000000000005 之后2ms2016-10-24 12:17:15,394 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.2016-10-24 12:17:15,397 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件上的尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000005000005 之后3ms2016-10-24 12:17:15,405 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.2016-10-24 12:17:15,409 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000005000005 之后3ms2016-10-24 12:17:15,414 WARN org.apache.hadoop.hdfs.BlockReaderFactory:构建远程块读取器的 I/O 错误.java.net.SocketException: 没有可用的缓冲区空间在 sun.nio.ch.Net.connect0(本地方法)在 sun.nio.ch.Net.connect(Net.java:465)在 sun.nio.ch.Net.connect(Net.java:457)在 sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)在 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)在 org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)在 org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)在 org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)在 org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)在 org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)在 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)在 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)在 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)在 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)在 java.io.FilterInputStream.read(FilterInputStream.java:83)在 com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)在 org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)在 org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)在 org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)在 org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)在 org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)在 org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)在 org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)在 org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)在 org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)在 java.lang.Thread.run(Thread.java:745)2016-10-24 12:17:15,427 WARN org.apache.hadoop.hdfs.DFSClient:无法连接到/xxx.xxx.xxx.xxx:50010 块,添加到死节点并继续.java.net.SocketException: 没有可用的缓冲区空间java.net.SocketException: 没有可用的缓冲区空间在 sun.nio.ch.Net.connect0(本地方法)在 sun.nio.ch.Net.connect(Net.java:465)在 sun.nio.ch.Net.connect(Net.java:457)在 sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)在 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)在 org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)在 org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)在 org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)在 org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)在 org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)在 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)在 org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)在 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)在 org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)在 java.io.FilterInputStream.read(FilterInputStream.java:83)在 com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)在 com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)在 org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)在 org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)在 org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)在 org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)在 org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)在 org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)在 org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)在 org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)在 org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)在 org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)在 java.lang.Thread.run(Thread.java:745)2016-10-24 12:17:15,436 INFO org.apache.hadoop.hdfs.DFSClient: 成功连接到/xxx.xxx.xxx.xxx:50010 for BP-813663273-xxx.xxx.xxx.xxx-14609630387606182038760618205lk2016-10-24 12:17:15,442 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.2016-10-24 12:17:15,444 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000005000005之后2ms2016-10-24 12:17:15,669 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复对 dfs 文件的租约 hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.2016-10-24 12:17:15,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:恢复租约,文件尝试=0=hdfs://namenode:8020/hbase/MasterProcWALs/state-000000000005300005 之后2ms

恐怕 WALProcedureStore 中的某些内容已损坏,但我不知道如何继续挖掘以找到问题所在.有什么线索吗?我可以在不尝试加载以前的损坏状态的情况下重新启动 Master 吗?

编辑:

我刚看到

<块引用>

WAL 是灾难发生时所需的生命线.类似于 MySQL 中的 BIN 日志,它记录对数据的所有更改.如果主存储发生问题,这很重要.因此,如果服务器崩溃,它可以有效地重播该日志,以将所有内容恢复到服务器崩溃前应位于的位置.这也意味着如果将记录写入 WAL 失败,则整个操作必须被视为失败.

让我们看一下 HBase 中如何完成此操作的高级视图.首先,客户端发起一个修改数据的动作.目前这是对 put(Put)、delete(Delete) 和 incrementColumnValue()(有时在此处缩写为incr")的调用.这些修改中的每一个都被包装到一个 KeyValue 对象实例中,并使用 RPC 调用通过线路发送.调用(理想情况下是批处理)到为受影响区域提供服务的 HRegionServer.一旦到达有效负载,所述 KeyValue 就会被路由到负责受影响行的 HRegion.数据写入 WAL,然后放入保存记录的实际 Store 的 MemStore.这也几乎描述了 HBase 的写入路径.

最终当 MemStore 达到一定大小或在特定时间后,数据将异步持久化到文件系统.在该时间范围内,数据以易失性存储在内存中.如果托管内存崩溃的 HRegionServer 数据丢失......但是对于这篇文章的主题,WAL 的存在!

这里的问题是因为这个错误 (HBASE-14712) WAL 最终拥有数千条日志.每次 master 尝试变得活跃时,它都有许多不同的日志需要恢复、租用和读取……最终导致名称节点被破坏.它用完了 tcp 缓冲区空间,一切都崩溃了.

为了能够启动 master,我必须手动删除 /hbase/MasterProcWALs/hbase/WALs 下的日志.执行此操作后,主节点能够变为活动状态,并且 HBase 集群重新上线.

正如 Ankit Singhai 所指出的,删除 /hbase/WALs 中的日志会导致数据丢失.只删除 /hbase/MasterProcWALs 中的日志应该没问题.

I have HBase running in a CDH cluster 5.7.0. After several months running without any problems, hbase service stopped and now it's impossible to start the HBase master (1 master and 4 region servers).

When I try to start it at some point the machine hangs and the last thing I can see in the master log is:

2016-10-24 12:17:15,150 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log
2016-10-24 12:17:15,152 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log after 2ms
2016-10-24 12:17:15,177 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log
2016-10-24 12:17:15,179 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log after 2ms
2016-10-24 12:17:15,394 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log
2016-10-24 12:17:15,397 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log after 3ms
2016-10-24 12:17:15,405 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log
2016-10-24 12:17:15,409 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log after 3ms
2016-10-24 12:17:15,414 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: No buffer space available
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
    at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
    at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
    at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
    at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
    at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
    at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,427 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xxx.xxx.xxx.xxx:50010 for block, add to deadNodes and continue. java.net.SocketException: No buffer space available
java.net.SocketException: No buffer space available
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
    at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
    at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
    at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
    at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
    at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
    at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,436 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /xxx.xxx.xxx.xxx:50010 for BP-813663273-xxx.xxx.xxx.xxx-1460963038761:blk_1079056868_5316127
2016-10-24 12:17:15,442 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log
2016-10-24 12:17:15,444 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log after 2ms
2016-10-24 12:17:15,669 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log
2016-10-24 12:17:15,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log after 2ms 

I'm afraid there's something corrupted in WALProcedureStore but I don't know how to keep digging to find the problem. Any clues? Can I start the Master freshly without trying to load a previous corrupted state?

EDIT:

I just saw this bug that I think is the same issue that is happening to me. Can I safely remove everything in /hbase/MasterProcWALs without deleting old data stored in hbase?

without

Thanks

解决方案

WAL or Write-Ahead-Log is an HBase mechanism to be able to recover modifications of data when everything crashes. Basically, every write operation to HBase will be logged in WAL before hand and if the system crashes and the data are still not persisted, then HBase would be able to recreate those writes from WAL.

This article helped me to understand better the whole process:

The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure.

Let"s look at the high level view of how this is done in HBase. First the client initiates an action that modifies data. This is currently a call to put(Put), delete(Delete) and incrementColumnValue() (abbreviated as "incr" here at times). Each of these modifications is wrapped into a KeyValue object instance and sent over the wire using RPC calls. The calls are (ideally batched) to the HRegionServer that serves the affected regions. Once it arrives the payload, the said KeyValue, is routed to the HRegion that is responsible for the affected row. The data is written to the WAL and then put into the MemStore of the actual Store that holds the record. And that also pretty much describes the write-path of HBase.

Eventually when the MemStore gets to a certain size or after a specific time the data is asynchronously persisted to the file system. In between that timeframe data is stored volatile in memory. And if the HRegionServer hosting that memory crashes the data is lost... but for the existence of what is the topic of this post, the WAL!

The problem here was that because of this bug (HBASE-14712) the WAL ended up having thousands of logs. Each time the master tried to become active it had that many different logs to recover, lease on, and read... that ended up ddosing the namenode. It ran out of tcp buffer space and everything crashed.

To be able to start the master I had to remove manually the logs under /hbase/MasterProcWALs and /hbase/WALs. After doing this the master was able to become active and HBase cluster went back online.

EDIT:

As Ankit Singhai pointed out, removing the logs in /hbase/WALs will result in data loss. Removing just the logs in /hbase/MasterProcWALs should be fine.

这篇关于HBase Master 无法启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆