为行键设置不同值的解决方案,但在hbase中设置相同的时间戳? [英] Solutions to put different values for a row-key but the same timestamps in hbase?

查看:348
本文介绍了为行键设置不同值的解决方案,但在hbase中设置相同的时间戳?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Hbase的新手。将文本文件中的数据批量加载到Hbase时,我遇到了一个问题。假设我有一张表:

  Key_id | f1:c1 | f2:c2 
row1'a''b'
row1'x''y'




  1. 当我解析2条记录并同时将其放入Hbase(相同时间戳)时,只有版本 {row1'x''y'} 已更新。以下是解释:




将数据放入HBase时,需要一个时间戳。时间戳可以由RegionServer自动生成,也可以由您提供。时间戳记必须在给定单元格的每个版本上都是唯一的,因为时间戳记标识了版本。例如,要修改单元格的先前版本,您将针对数据本身发出一个具有不同值的Put,但发布时间戳相同。


我正在考虑指定时间戳的想法,但我不知道如何为批量加载自动设置时间戳,并且会影响加载性能?我需要快速安全地导入大数据。


  1. 我尝试解析并将每条记录放入表中,但速度非常慢......因此另一个问题是:在将数据放入hbase之前,应该批量处理多少个记录/数据大小。 (我写了一个简单的java程序来放,比使用imporrtsv工具的命令导入要慢很多,我不知道这个工具的批量大小。)

许多thx为您提供建议!

解决方案

Q1:Hbase使用时间戳。如果您不提供它,将采用由hbase系统提供的默认值。



在放入请求中,如果您有这样的要求,您也可以更新自定义时间。它不会影响性能。



Q2:您可以用两种方式来完成。


  • 具有批处理技术的简单java客户端,如下所示。

  • > 例如:#1使用批处理技术的简单java客户端。



    我使用hbase批处理解析json的100000条记录列表对象到您的独立csv客户端)



    以下是我通过其实现的代码片段。同样的事情也可以在解析其他格式的时候完成)

    可能你需要在2个地方调用这个方法

    1)用100000条记录批处理。

    2)用于处理提醒你的批记录小于100000

      public void addRecord(final ArrayList< Put> puts,final String tableName)throws Exception {
    try {
    final HTable table = new HTable(HBaseConnection。 getHBaseConfiguration(),getTable(tableName));
    table.put(puts);
    LOG.info(INSERT record [s]+ puts.size()+to table+ tableName +OK。);
    } catch(final Throwable e){
    e.printStackTrace();
    } finally {
    LOG.info(Processed --->+ puts.size());
    if(puts!= null){
    puts.clear();



    code $
    $ b

    注意:内部批量大小由 hbase.client.write.buffer 控制,如下面的配置xmls中的一样

     <属性> 
    < name> hbase.client.write.buffer< / name>
    <值> 20971520< /值> //大约2 MB我猜想
    < / property>

    其默认值为2mb大小。一旦你的缓冲区被填满了,它就会冲洗所有的放入物以实际插入到你的表中。 b ,可以使用批量
    技术映射客户端或独立客户端。批处理由上面的缓冲区属性控制


    I'm new at Hbase. I'm facing a problem when bulk loading data from a text file into Hbase. Assuming I have a following table:

    Key_id | f1:c1 | f2:c2
    row1     'a'     'b'
    row1     'x'     'y'
    

    1. When I parse 2 records and put it into Hbase at the same time (same timestamps), then only version {row1 'x' 'y'} updated. Here is the explanation:

    When you put data into HBase, a timestamp is required. The timestamp can be generated automatically by the RegionServer or can be supplied by you. The timestamp must be unique per version of a given cell, because the timestamp identifies the version. To modify a previous version of a cell, for instance, you would issue a Put with a different value for the data itself, but the same timestamp.

    I'm thinking about the idea that specify the timestamps but I don't know how to set automatically timestamps for bulkloading and Does it affect the loading performance?? I need fastest and safely importing process for big data.

    1. I tried to parse and put Each record into table, but the speed is very very slow...So another question is: How many records/size of data should in batch before put into hbase. (I write a simple java program to put. It's slower much more than I use Imporrtsv tool by commands to import. I dont know exactly how many size in batch of this tool..)

    Many thx for your advise!

    解决方案

    Q1: Hbase maintains versions using timestamps. If you wont provide it will take default provided by hbase system.

    In the put request you can update custom time as well if you have such requirement. It doesn't not effect performance.

    Q2 : You can do it in 2 ways.

    • Simple java client with batching technique shown below.

    • Mapreduce importtsv(batch client)

    Ex: #1 Simple java client with batching technique.

    I used hbase puts in batch List objects of 100000 record for parsing json(similar to your standalone csv client )

    Below is code snippet through which I achieved this. Same thing can be done while parsing other formats as well)

    May be you need to call this method in 2 places

    1) with Batch of 100000 records.

    2) For processing reminder of your batch records are less than 100000

      public void addRecord(final ArrayList<Put> puts, final String tableName) throws Exception {
            try {
                final HTable table = new HTable(HBaseConnection.getHBaseConfiguration(), getTable(tableName));
                table.put(puts);
                LOG.info("INSERT record[s] " + puts.size() + " to table " + tableName + " OK.");
            } catch (final Throwable e) {
                e.printStackTrace();
            } finally {
                LOG.info("Processed ---> " + puts.size());
                if (puts != null) {
                    puts.clear();
                }
            }
        }
    

    Note : Batch size internally it is controlled by hbase.client.write.buffer like below in one of your config xmls

    <property>
             <name>hbase.client.write.buffer</name>
             <value>20971520</value> // around 2 mb i guess
     </property>
    

    which has default value say 2mb size. once you buffer is filled then it will flush all puts to actually insert in to your table.

    Furthermore, Either mapreduce client or stand alone client with batch technique. batching is controlled by above buffer property

    这篇关于为行键设置不同值的解决方案,但在hbase中设置相同的时间戳?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆