HBase Master无法启动 [英] HBase Master won't start

查看:256
本文介绍了HBase Master无法启动的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在CDH群集5.7.0中运行了HBase.在运行了几个月没有任何问题之后,hbase服务停止了,现在无法启动HBase主服务器(1个主服务器和4个区域服务器).

I have HBase running in a CDH cluster 5.7.0. After several months running without any problems, hbase service stopped and now it's impossible to start the HBase master (1 master and 4 region servers).

当我尝试在某个时候启动它时,机器挂起了,我在主日志中看到的最后一件事是:

When I try to start it at some point the machine hangs and the last thing I can see in the master log is:

2016-10-24 12:17:15,150 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log
2016-10-24 12:17:15,152 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005528.log after 2ms
2016-10-24 12:17:15,177 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log
2016-10-24 12:17:15,179 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005529.log after 2ms
2016-10-24 12:17:15,394 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log
2016-10-24 12:17:15,397 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005530.log after 3ms
2016-10-24 12:17:15,405 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log
2016-10-24 12:17:15,409 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005531.log after 3ms
2016-10-24 12:17:15,414 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.net.SocketException: No buffer space available
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
    at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
    at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
    at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
    at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
    at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
    at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,427 WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /xxx.xxx.xxx.xxx:50010 for block, add to deadNodes and continue. java.net.SocketException: No buffer space available
java.net.SocketException: No buffer space available
    at sun.nio.ch.Net.connect0(Native Method)
    at sun.nio.ch.Net.connect(Net.java:465)
    at sun.nio.ch.Net.connect(Net.java:457)
    at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3499)
    at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:838)
    at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:753)
    at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:374)
    at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:662)
    at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:889)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:942)
    at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:742)
    at java.io.FilterInputStream.read(FilterInputStream.java:83)
    at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
    at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
    at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
    at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1006)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:969)
    at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:300)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:509)
    at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1175)
    at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1097)
    at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:681)
    at org.apache.hadoop.hbase.master.HMaster.access$500(HMaster.java:187)
    at org.apache.hadoop.hbase.master.HMaster$1.run(HMaster.java:1756)
    at java.lang.Thread.run(Thread.java:745)
2016-10-24 12:17:15,436 INFO org.apache.hadoop.hdfs.DFSClient: Successfully connected to /xxx.xxx.xxx.xxx:50010 for BP-813663273-xxx.xxx.xxx.xxx-1460963038761:blk_1079056868_5316127
2016-10-24 12:17:15,442 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log
2016-10-24 12:17:15,444 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005532.log after 2ms
2016-10-24 12:17:15,669 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recover lease on dfs file hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log
2016-10-24 12:17:15,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://namenode:8020/hbase/MasterProcWALs/state-00000000000000005533.log after 2ms 

恐怕WALProcedureStore中有损坏的东西,但我不知道如何继续挖掘以发现问题.有什么线索吗?是否可以在不尝试加载以前的损坏状态的情况下重新启动主服务器?

I'm afraid there's something corrupted in WALProcedureStore but I don't know how to keep digging to find the problem. Any clues? Can I start the Master freshly without trying to load a previous corrupted state?

编辑:

我刚刚看到了此错误,我认为这是相同的问题,发生在我身上.我可以安全地删除/hbase/MasterProcWALs中的所有内容而不删除存储在hbase中的旧数据吗?

I just saw this bug that I think is the same issue that is happening to me. Can I safely remove everything in /hbase/MasterProcWALs without deleting old data stored in hbase?

谢谢

推荐答案

WAL或Write-Ahead-Log是一种HBase机制,能够在所有崩溃时恢复对数据的修改.基本上,对HBase的所有写操作都将事先记录在WAL中,如果系统崩溃且数据仍未持久,则HBase能够从WAL重新创建这些写操作.

WAL or Write-Ahead-Log is an HBase mechanism to be able to recover modifications of data when everything crashes. Basically, every write operation to HBase will be logged in WAL before hand and if the system crashes and the data are still not persisted, then HBase would be able to recreate those writes from WAL.

文章帮助了我想更好地了解整个过程:

This article helped me to understand better the whole process:

WAL是发生灾难时所需的生命线.类似于MySQL中的BIN日志,它记录对数据的所有更改.如果主存储发生故障,这很重要.因此,如果服务器崩溃,它可以有效地重播该日志,以使一切恢复到崩溃之前服务器应有的位置.这也意味着,如果将记录写入WAL失败,则必须将整个操作视为失败.

The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure.

让我们看一下如何在HBase中完成此操作的高级视图.首先,客户端启动一个修改数据的操作.当前,这是对put(Put),delete(Delete)和incrementColumnValue()的调用(有时在此缩写为"incr").这些修改中的每一个都被包装到KeyValue对象实例中,并使用RPC调用通过电线发送.调用(理想情况下是批量)到为受影响区域提供服务的HRegionServer.一旦到达有效负载,则将所述KeyValue路由到负责受影响行的HRegion.数据被写入WAL,然后放入保存该记录的实际商店的MemStore中.这也几乎描述了HBase的写路径.

Let"s look at the high level view of how this is done in HBase. First the client initiates an action that modifies data. This is currently a call to put(Put), delete(Delete) and incrementColumnValue() (abbreviated as "incr" here at times). Each of these modifications is wrapped into a KeyValue object instance and sent over the wire using RPC calls. The calls are (ideally batched) to the HRegionServer that serves the affected regions. Once it arrives the payload, the said KeyValue, is routed to the HRegion that is responsible for the affected row. The data is written to the WAL and then put into the MemStore of the actual Store that holds the record. And that also pretty much describes the write-path of HBase.

最终,当MemStore达到某个大小或在特定时间之后,数据将异步保存到文件系统中.在这两个时间段之间,数据易失性地存储在内存中.如果托管该内存的HRegionServer崩溃了,数据就会丢失...但是对于这篇帖子的主题,存在WAL!

Eventually when the MemStore gets to a certain size or after a specific time the data is asynchronously persisted to the file system. In between that timeframe data is stored volatile in memory. And if the HRegionServer hosting that memory crashes the data is lost... but for the existence of what is the topic of this post, the WAL!

这里的问题是由于此错误(HBASE-14712) WAL最终有成千上万的日志.每次主服务器尝试变为活动状态时,它都有许多不同的日志要恢复,租用和读取...最终导致了namenode的失败.它用完了tcp缓冲区空间,一切都崩溃了.

The problem here was that because of this bug (HBASE-14712) the WAL ended up having thousands of logs. Each time the master tried to become active it had that many different logs to recover, lease on, and read... that ended up ddosing the namenode. It ran out of tcp buffer space and everything crashed.

要启动主服务器,我必须手动删除/hbase/MasterProcWALs/hbase/WALs下的日志.完成此操作后,主服务器可以变为活动状态,并且HBase群集又重新联机.

To be able to start the master I had to remove manually the logs under /hbase/MasterProcWALs and /hbase/WALs. After doing this the master was able to become active and HBase cluster went back online.

正如Ankit Singhai指出的那样,删除/hbase/WALs中的日志将导致数据丢失.只需删除/hbase/MasterProcWALs中的日志就可以了.

As Ankit Singhai pointed out, removing the logs in /hbase/WALs will result in data loss. Removing just the logs in /hbase/MasterProcWALs should be fine.

这篇关于HBase Master无法启动的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆