HBase批量删除为“完整批量加载” [英] HBase bulk delete as "complete bulk load"

查看:143
本文介绍了HBase批量删除为“完整批量加载”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想删除HBase表中的300百万行。我可以使用HBase API并发送一批Delete对象。但我担心这会花费很多时间。



之前的代码就是这种情况,我想插入数百万行。我没有使用HBase API并发送一批Puts,而是使用了Map Reduce作业,它发出RowKey / Put作为值并使用 HFileOutputFormat2.configureIncrementalLoad(job,table,regionLocator)设置我的Reducer,以便它直接写入准备被 LoadIncrementalHFiles (完整批量加载)快速加载的输出。它要快得多(5分钟而不是3小时)。

所以我想做同样的批量删除操作。



然而,看起来我不能在Delete中使用这种技术,因为 HFileOutputFormat2 会尝试为 KeyValue 或放置(PutSortReducer),但没有任何东西存在删除。



我的第一个问题是为什么没有 DeleteSortReducer来启用删除的完整批量加载技术?它是否缺少了一些尚未完成的事情?或者是有更深的理由证明这一点?



第二个问题,这是相关的:如果我复制/粘贴PutSortReducer的代码,将其调整为Delete并传递它作为我的工作的减速机,是否会起作用? HBase完成大容量装载将产生充满墓碑的HFiles吗?



示例:

  public class DeleteSortReducer extends 
Reducer< ImmutableBytesWritable,Delete,ImmutableBytesWritable,KeyValue> {
$ b $ @Override
protected void reduce(
ImmutableBytesWritable row,
java.lang.Iterable< Delete> deletes,
Reducer< ImmutableBytesWritable,Delete,
ImmutableBytesWritable,KeyValue> .Context上下文)
抛出java.io.IOException,InterruptedException
{
//尽管reduce()被称为每行,处理病态情况
long threshold = context.getConfiguration()。getLong(
putsortreducer.row.threshold,1L *(1 <<30));
迭代器<删除> iter = deletes.iterator();
while(iter.hasNext()){
TreeSet< KeyValue> map = new TreeSet< KeyValue>(KeyValue.COMPARATOR);
long curSize = 0; (iter.hasNext()&& curSize< threshold){
; //删除d = iter.next();
//停在最后或RAM阈值
。 (单元格:单元格){
KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
for(List< Cell> cells:d.getFamilyCellMap
map.add(kv);
curSize + = kv.heapSize();


$ context_setStatus(Read+ map.size()+entries + map.getClass()
+( + StringUtils.humanReadableInt(curSize)+));
int index = 0; (KeyValue kv:map){
context.write(row,kv);

if(++ index%100 == 0)
context.setStatus(Wrote+ index);
}

//如果我们有更多的条目来处理
if(iter.hasNext()){
//强制刷新,因为我们不能保证行内排序顺序
context.write(null,null);
}
}
}
}


解决方案首先,简单介绍一下HBase中的删除操作是如何工作的。在删除命令中,HBase将数据标记为已删除,并将其相关信息写入HFile。实际上,数据不会从光盘中删除,存储中有两条记录:数据和删除标记。只有压缩后,数据才会从光盘存储中删除。

所有这些信息都表示为 KEYVALUE 。对于表示数据的KeyValue,有 KeyValue .Type 等于放置。对于删除标记KeyValue.Type设置以下值之一删除 DeleteColumn DeleteFamily DeleteFamilyVersion

在您的情况下,您可以通过为 KeyValue.Type 创建具有特殊值的KeyValue来实现批量删除。例如,如果你想删除唯一的一列,你应该使用构造函数创建一个 KeyValue

  KeyValue(byte [] row,byte [] family,byte []限定符,
long timestamp,KeyValue.Type类型)

//示例

KeyValue kv = new KeyValeu(行,系列,限定符,
时间,KeyValue.Type.DeleteColumn)

第一个问题的答案不需要特殊的 DeleteSortReducer ,您应该将reducer KEYVALUE 。第二个问题的答案是否定的。

I would like to delete 300 millions of rows in a HBase table. I could use the HBase API and send batch of Delete objects. But I am afraid that it takes lots of time.

It was the case for a previous code where I wanted to insert millions of rows. Instead of using the HBase API and send batch of Puts, I used a Map Reduce job which emits RowKey / Put as values and use the HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator) to set my Reducer so that it writes directly the output ready to be fast loaded by LoadIncrementalHFiles (complete bulk load). It was much much quicker (5 minutes instead of 3 hours).

So I wanted to do the same to bulk delete.

However, it seems that I cannot use this technique with Delete as HFileOutputFormat2 tries to configure Reducer for KeyValue or Put (PutSortReducer) but nothing exists for Delete.

My 1st question is why is there not a "DeleteSortReducer" to enable the complete bulk load technique for Delete ? Is it just something missing, which has not been done ? Or is there a deeper reason that justifies that ?

Second question, which is kind of related : if I copy/paste the code of PutSortReducer, adapt it for Delete and pass it as my job's Reducer, is it going to work ? Is HBase complete bulk load going to produce HFiles full of tombstones ?

Example :

public class DeleteSortReducer extends
        Reducer<ImmutableBytesWritable, Delete, ImmutableBytesWritable, KeyValue> {

    @Override
    protected void reduce(
            ImmutableBytesWritable row,
            java.lang.Iterable<Delete> deletes,
            Reducer<ImmutableBytesWritable, Delete,
                    ImmutableBytesWritable, KeyValue>.Context context)
            throws java.io.IOException, InterruptedException
    {
        // although reduce() is called per-row, handle pathological case
        long threshold = context.getConfiguration().getLong(
                "putsortreducer.row.threshold", 1L * (1<<30));
        Iterator<Delete> iter = deletes.iterator();
        while (iter.hasNext()) {
            TreeSet<KeyValue> map = new TreeSet<KeyValue>(KeyValue.COMPARATOR);
            long curSize = 0;
            // stop at the end or the RAM threshold
            while (iter.hasNext() && curSize < threshold) {
                Delete d = iter.next();
                for (List<Cell> cells: d.getFamilyCellMap().values()) {
                    for (Cell cell: cells) {
                        KeyValue kv = KeyValueUtil.ensureKeyValue(cell);
                        map.add(kv);
                        curSize += kv.heapSize();
                    }
                }
            }
            context.setStatus("Read " + map.size() + " entries of " + map.getClass()
                    + "(" + StringUtils.humanReadableInt(curSize) + ")");
            int index = 0;
            for (KeyValue kv : map) {
                context.write(row, kv);
                if (++index % 100 == 0)
                    context.setStatus("Wrote " + index);
            }

            // if we have more entries to process
            if (iter.hasNext()) {
                // force flush because we cannot guarantee intra-row sorted order
                context.write(null, null);
            }
        }
    }
}

解决方案

First of all, a few words how delete operation is working in HBase. On delete command, HBase marks data as deleted and writes information about it to the HFile. Actually, data is not deleted from the disc, and two records are present in the storage: data and the deletion mark. Only after compaction, data will be deleted from disc storage.

All this information is represented as KeyValue. For KeyValue for data representing has KeyValue.Type equal to Put. For deletion mark KeyValue.Type is set one of the following values Delete, DeleteColumn, DeleteFamily, DeleteFamilyVersion.

In your case, you can achieve a bulk deletion by creating the KeyValue with special value for KeyValue.Type. For example, if you want to delete the only one column, you should create a KeyValue, using constructor

    KeyValue(byte[] row, byte[] family, byte[] qualifier, 
                         long timestamp, KeyValue.Type type)

   //example 

    KeyValue kv = new KeyValeu(row, family, qualifier, 
                                  time, KeyValue.Type.DeleteColumn)

The answer to the first question you don't need a special DeleteSortReducer, you should configure a reducer for KeyValue. For second question answer is no.

这篇关于HBase批量删除为“完整批量加载”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆