SingleColumnValueFilter没有返回适当的行数 [英] SingleColumnValueFilter not returning proper number of rows

查看:214
本文介绍了SingleColumnValueFilter没有返回适当的行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我们的HBase表中,每一行都有一个名为抓取标识的列。使用MapReduce作业,我们只想在任何时候处理来自给定爬网的行。为了更高效地运行这个工作,我们给了我们的扫描对象一个过滤器(我们希望)会除去那些具有给定的爬行标识符的行。然而,我们很快发现我们的工作没有处理正确的行数。



我写了一个测试映射器来简单地计算具有正确爬网标识符的行数,没有任何过滤器。它遍历表中的所有行,并计算正确的预期行数(〜15000)。当我们做了同样的工作,添加一个过滤器的扫描对象,计数下降到~3000。在这两个任务之间或之间没有对表本身的操作。

由于添加扫描过滤器导致可见行发生如此巨大的变化,我们期望

我们的MapReduce作业具有一个映射器:

  public static class RowCountMapper extends TableMapper< ImmutableBytesWritable,Put> {

public String crawlIdentifier;

//计数器
私有静态枚举CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}

@Override
public void setup(Context context ){
Configuration configuration = context.getConfiguration();
crawlIdentifier = configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);


$ b @Override
public void map(ImmutableBytesWritable legacykey,Result row,Context context){
String rowIdentifier = HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if(StringUtils.equals(crawlIdentifier,rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);





$ p $过滤器设置就像这样:

 字符串crawlIdentifier =配置.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY); 
if(StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException(Crawl Identifier not set。);

$ b $ //建立一个HBase扫描器
扫描扫描=新扫描();
SingleColumnValueFilter filter = new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);

我们使用了错误的过滤器,还是配置错了? b

编辑:我们正在按照 https://issues.apache.org/jira/browse/HBASE-2198 ,但我很确定扫描包括所有的家庭默认情况下。

解决方案

过滤器看起来是正确的,但是在某些情况下,可能会导致这种情况与字符编码有关。您的过滤器正在使用使用UTF8 [1]的Bytes.toBytes(String),而您可能在HBaseSchema中使用本地字符编码,或者如果您使用String.getBytes()[2 ]。检查crawlIdentifier最初写入HBase使用以下,以确保过滤器比较像过滤扫描。

  Bytes.toBytes(crawlIdentifier)

[1] http://hbase.apache.org/apidocs/org/apache/hadoop /hbase/util/Bytes.html#toBytes(java.lang.String
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes ()


In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.

I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.

Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.

Our MapReduce job features a single mapper:

public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{

    public String crawlIdentifier;

    // counters
    private static enum CountRows {
        ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
    }

    @Override
    public void setup(Context context){
        Configuration configuration=context.getConfiguration();
        crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);

    }

    @Override
    public void map(ImmutableBytesWritable legacykey, Result row, Context context){
        String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
        if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
            context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
        }
    }
}

The filter setup is like this:

String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
    throw new IllegalArgumentException("Crawl Identifier not set.");
}

// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
    HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
    CompareOp.EQUAL,
    Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);

Are we using the wrong filter, or have we configured it wrong?

EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.

解决方案

The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.

Bytes.toBytes(crawlIdentifier)

[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String) [2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

这篇关于SingleColumnValueFilter没有返回适当的行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆