为什么导出HBase表比它的原始大4倍？ [英] Why exported HBase table is 4 times bigger than its original?

查看：110 发布时间：2018/5/31 19:56:40 hadoop hbase hdfs

本文介绍了为什么导出HBase表比它的原始大4倍？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要在更新到新版本之前备份HBase表。我决定使用标准导出工具将表格导出到hdfs，然后移动它到本地文件系统。出于某种原因，导出表比原始表大4倍：

  hdfs dfs -du -h 
 1.4T备份-my-table 
 
 hdfs dfs -du -h / hbase / data / default / 
 417G my-table

可能是什么原因？它是否与压缩有关？

也许我做备份的方式很重要。首先，我从目标表格中制作了快照，然后将克隆到复制表，然后从这个复制的表中删除不必要的列族（所以我期望结果大小减少2倍），然后在这个副本表上运行导出工具。

upd适用于将来的访问者：这里有正确的命令来导出压缩表

./ hbase org.apache.hadoop.hbase.mapreduce.Export \ -Dmapreduce.output.fileoutputformat.compress = true \ -Dmapreduce.output.fileoutputformat.compress .codec = org.apache.hadoop.io.compress.GzipCodec \ -Dmapreduce.output.fileoutputformat.compress.type = BLOCK \ -Dhbase.client.scanner.caching = 200 \ table-to-export export-dir

解决方案
可以使用 SNAPPY 或其他压缩技术。像这样
create't1'，{NAME => 'cf1'，COMPRESSION => 'SNAPPY'}

Compression support Check

使用 CompressionTest 验证snappy支持已启用，并且libs可以加载到群集的所有节点上：

$ hbase org.apache.hadoop.hbase.util .CompressionTest hdfs：// host / path / to / hbase snappy

导出命令源以应用压缩：

如果您深入了解Export命令（

见下面的属性，可以大大减小尺寸..

mapreduce.output.fileoutputformat.compress = true

mapreduce.output.fileoutputformat.compress.codec = org .apache.hadoop.io.compress.GzipCodec

mapreduce.output.fileoutputformat.compress.type = BLOCK

/ * * @param errorMsg错误消息。可以为null。 * / private static void usage（final String errorMsg）{ if（errorMsg！= null&&& errorMsg.length（）> 0）{ System。 err.println（ERROR：+ errorMsg）; } System.err.println（用法：导出[-D< property = value>] *< tablename>< outputdir> [< versions> + [< starttime> [< endtime>]] [^ [regex pattern]或[Prefix] to filter]] \ n）; System.err.println（注意：-D属性将应用于使用的conf。）; System.err.println（例如：）; System.err.println（-D mapreduce.output.fileoutputformat.compress = true）; System.err.println（-D mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.GzipCodec）; System.err.println（-D mapreduce.output.fileoutputformat.compress.type = BLOCK）; System.err.println（此外，可以指定以下扫描属性）; System.err.println（控制/限制导出的内容..）; System.err.println（-D+ TableInputFormat.SCAN_COLUMN_FAMILY +=< familyName>）; System.err.println（-D+ RAW_SCAN += true）; System.err.println（-D+ TableInputFormat.SCAN_ROW_START +=< ROWSTART>）; System.err.println（-D+ TableInputFormat.SCAN_ROW_STOP +=< ROWSTOP>）; System.err.println（-D+ JOB_NAME_CONF_KEY += jobName - 使用指定的mapreduce作业名称进行导出）; System.err.println（对于性能请考虑以下属性：\\\ +-Dhbase.client.scanner.caching = 100 \\\ +-Dmapreduce .map.speculative = false \ +-Dmapreduce.reduce.speculative = false）; System.err.println（对于具有很宽行的表，考虑设置批量大小如下：\\\ +-D+ EXPORT_BATCHING += 10）; }

另请参阅 getExportFilter 这可能会有助于缩小你的出口范围。

private static Filter getExportFilter（String [] args）{ 138过滤器exportFilter = null; 139 String filterCriteria =（args.length> 5）？ args [5]：null; 140 if（filterCriteria == null）return null; 141 if（filterCriteria.startsWith（^））{ 142 String regexPattern = filterCriteria.substring（1，filterCriteria.length（））; 143 exportFilter = new RowFilter（CompareOp.EQUAL，new RegexStringComparator（regexPattern））; 144} else { 145 exportFilter = new PrefixFilter（Bytes.toBytesBinary（filterCriteria））; 146} 147 return exportFilter; 148}

I need to backup HBase table before update to a newer version. I decided to export table to hdfs with standard Export tool and then move it to local file system. For some reason exported table is 4 times larger than original one:
hdfs dfs -du -h 1.4T backup-my-table hdfs dfs -du -h /hbase/data/default/ 417G my-table
What can be the reason? Is it somehow related to compression?

P.S. Maybe the way I made the backup matters. First I made a snapshot from target table, then cloned it to a copy table, then deleted unnecessary column families from this copied table (so I expected the result size to be 2 times smaller), then I run export tool on this copy table.

upd for future visitors: here's the correct command to export table with compression
./hbase org.apache.hadoop.hbase.mapreduce.Export \ -Dmapreduce.output.fileoutputformat.compress=true \ -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \ -Dmapreduce.output.fileoutputformat.compress.type=BLOCK \ -Dhbase.client.scanner.caching=200 \ table-to-export export-dir

解决方案
May be you compressed using SNAPPY or some other compression technique. like this
create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }

Compression support Check

Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy

Export Command Source to apply compression:

If you dig deep to understand Export command (source), then you will find

see below properties which could reduce size drastically..

mapreduce.output.fileoutputformat.compress=true

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

mapreduce.output.fileoutputformat.compress.type=BLOCK

/* * @param errorMsg Error message. Can be null. */ private static void usage(final String errorMsg) { if (errorMsg != null && errorMsg.length() > 0) { System.err.println("ERROR: " + errorMsg); } System.err.println("Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> " + "[<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]\n"); System.err.println(" Note: -D properties will be applied to the conf used. "); System.err.println(" For example: "); System.err.println(" -D mapreduce.output.fileoutputformat.compress=true"); System.err.println(" -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec"); System.err.println(" -D mapreduce.output.fileoutputformat.compress.type=BLOCK"); System.err.println(" Additionally, the following SCAN properties can be specified"); System.err.println(" to control/limit what is exported.."); System.err.println(" -D " + TableInputFormat.SCAN_COLUMN_FAMILY + "=<familyName>"); System.err.println(" -D " + RAW_SCAN + "=true"); System.err.println(" -D " + TableInputFormat.SCAN_ROW_START + "=<ROWSTART>"); System.err.println(" -D " + TableInputFormat.SCAN_ROW_STOP + "=<ROWSTOP>"); System.err.println(" -D " + JOB_NAME_CONF_KEY + "=jobName - use the specified mapreduce job name for the export"); System.err.println("For performance consider the following properties:\n" + " -Dhbase.client.scanner.caching=100\n" + " -Dmapreduce.map.speculative=false\n" + " -Dmapreduce.reduce.speculative=false"); System.err.println("For tables with very wide rows consider setting the batch size as below:\n" + " -D" + EXPORT_BATCHING + "=10"); }

Also see getExportFilter which might be useful in your case to narrow your export.

private static Filter getExportFilter(String[] args) { 138 Filter exportFilter = null; 139 String filterCriteria = (args.length > 5) ? args[5]: null; 140 if (filterCriteria == null) return null; 141 if (filterCriteria.startsWith("^")) { 142 String regexPattern = filterCriteria.substring(1, filterCriteria.length()); 143 exportFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator(regexPattern)); 144 } else { 145 exportFilter = new PrefixFilter(Bytes.toBytesBinary(filterCriteria)); 146 } 147 return exportFilter; 148 }

这篇关于为什么导出HBase表比它的原始大4倍？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么导出HBase表比它的原始大4倍？ [英] Why exported HBase table is 4 times bigger than its original?

问题描述

Compression support Check

导出命令源以应用压缩：

另请参阅 `getExportFilter` 这可能会有助于缩小你的出口范围。

Compression support Check

Export Command Source to apply compression:

Also see `getExportFilter` which might be useful in your case to narrow your export.

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

为什么导出HBase表比它的原始大4倍？ [英] Why exported HBase table is 4 times bigger than its original?

问题描述

Compression support Check

导出命令源以应用压缩：

另请参阅 getExportFilter 这可能会有助于缩小你的出口范围。

Compression support Check

Export Command Source to apply compression:

Also see getExportFilter which might be useful in your case to narrow your export.

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

另请参阅 `getExportFilter` 这可能会有助于缩小你的出口范围。

Also see `getExportFilter` which might be useful in your case to narrow your export.

登录关闭