如何跳过缺少特定列的HBase行? [英] How can I skip HBase rows that are missing specific columns?

查看:196
本文介绍了如何跳过缺少特定列的HBase行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用表映射器在HBase上编写mapreduce作业。我想跳过没有特定列的行。例如,如果映射器从元族,源限定符列中读取,则映射器应该预期在该列中有东西。我知道我可以将列添加到扫描对象,但我希望这只是限制扫描可以看到哪些行,而不是哪些列需要在那里。



什么我可以使用过滤器跳过没有我需要的列的行吗?

另外,过滤器概念本身有点奇怪。过滤器是以逐行为基础还是以关键值为基础进行操作? 过滤行意味着跳过行或包括它,或简单地通过过滤器?



有什么地方,这比hbase javadocs更清楚地解释HBase的书是解答大量问题的最佳地方:
http://hbase.apache.org/book/client.filter.html
in特别解释了过滤器是如何工作的。

过滤器在服务器端执行非常高效,并减少了通过网络传输的数据量。我同意javadocs真的使包含或排除非明显的语义,但我认为这本书说清楚:过滤器定义了什么必须是真正的才能将行返回到客户端。

扫描也是确定必须返回什么的好方法,但是您在定义扫描时需要小心。如果你定义了一个扫描来包含整个列族在一个api调用,然后在你的代码中,定义一个特定的列限定符要返回,第二个调用将覆盖第一个调用,只有特定的限定符将被返回,没有其他列的限定符将被返回。


I'm writing a mapreduce job over HBase using table mapper. I want to skip rows that don't have specific columns. For example, if the mapper reads from the "meta" family, "source" qualifier column, the mapper should expect something to be in that column. I know I can add columns to the scan object, but I expect this merely limits which rows can be seen by the scan, not which columns need to be there.

What filter can I use to skip rows without the columns I need?

Also, the filter concept itself is a little strange. Does the filter operate on a row-by-row basis or a keyvalue-by-keyvalue basis? Does "filter a row" mean skip the row or include it, or simply put it through a filter?

Is there somewhere where this is explained more clearly than the hbase javadocs?

解决方案

The HBase book is the best place to answer a large number of questions: http://hbase.apache.org/book/client.filter.html in particular explains how filters work.

Filters are very efficient as they are performed on the server side and reduce the amount of data flowing over the network. I agree that the javadocs really makes the semantics of include or exclude non-obvious, but I think the book makes it clear: Filters define what must be true in order to return the row to the client.

Scans are also a good way to defining what must be returned, although you need to be careful in how you define your scans. If you define a scan to contain a whole column family in one api call, and then later in your code, define a specific column qualifier to be returned, the second call will override the first call and only that specific qualifier will be returned, and no other column qualifier in the column family will be returned.

这篇关于如何跳过缺少特定列的HBase行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆