HBase扫描性能 [英] HBase Scan Performance

查看:138
本文介绍了HBase扫描性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在进行一次范围扫描,它给了我500k条记录。
如果我设置 scan.setCaching(100000)它花费的时间少于一秒,但是如果 scan.setCaching(100000)设置为scan.setBlockCache(false),则不会设置它接近38秒。

scan.setCaching(100000)会发生什么?这些行会被缓存吗?



我在第一次扫描后丢弃OS缓存,但扫描记录的时间没有变化。为什么?



那么我该如何检查读取的性能?

解决方案

Scan.setCaching 是一个用词不当。它应该真的被称为像 Scan.setPrefetch setCaching 实际上指定了每个RPC将传输多少行到区域服务器。如果您使用 setCaching(1),那么每当您调用 next()时,您都需要支付往返费用到区域服务器。将其设置为更大数字的缺点是,您需要为客户端支付额外的内存,并且可能会获取不会使用的行,例如,如果您在达到特定数量的行后停止扫描,或者找到一个特定的值。



Scan.setBlockCache 表示Chandra指出的完全不同的东西。它基本上指示区域服务器不将任何数据从该扫描中拉出到HBase BlockCache中,HBase BlockCache是​​MemStore中独立的内存池。请注意,MemStores用于写入,BlockCache用于读取,而这两个内存是完全分开的。 HBase目前不使用BlockCache作为回写缓存。您可以使用 hbase-site.xml hfile.block.cache.size 配置设置来控制块缓存的大小C>。同样,您可以通过 hbase.regionserver.global.memstore.size 设置来控制MemStore的总池大小。



如果您正在执行全表扫描,并且您不希望刷新块缓存中的当前工作集,则可能需要使用 setBlockCache(false)。否则,如果您正在扫描频繁使用的数据,最好将 setBlockCache 单独留下。


I am performing a range scan that is giving me 500k records. If I set scan.setCaching(100000) it took less than one second, but if scan.setCaching(100000) is not set it took nearly 38 sec.

If I set scan.setBlockCache(false) and scan.setCaching(100000) what will happen? Will the rows be cached or not?

I am dropping OS cache after first scan but there is no change in the time for scanning the records. Why?

Then how can I check the read performance?

解决方案

Scan.setCaching is a misnomer. It should really be called something like Scan.setPrefetch. setCaching actually specifies how many rows will be transmitted per RPC to the regionserver. If you use setCaching(1) then every time you call next() you pay the cost of a round-trip to the regionserver. The down side to setting it to a larger number is that you pay for extra memory in the client, and potentially, you are fetching rows that you won't use, for example, if you stop scanning after reaching a certain number of rows or after you've found a specific value.

Scan.setBlockCache means something entirely different like Chandra pointed out. It basically instructs the regionserver to not pull any data from this Scan into the HBase BlockCache which is a separate pool of memory from the MemStore. Note that MemStores are used for writing and BlockCache is used for reading, and these two pieces of memory are completely separate. HBase currently does not use the BlockCache as a write-back cache. You can control the size of the block cache with the hfile.block.cache.size config setting in hbase-site.xml. Similarly you can control the total pool size of the MemStore via the hbase.regionserver.global.memstore.size setting.

You might want to use setBlockCache(false) if you are doing a full table scan, and you don't want to flush your current working set in the block cache. Otherwise, if you are scanning data that is being used frequently, it would probably be better to leave the setBlockCache alone.

这篇关于HBase扫描性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆