为什么导出HBase表比它的原始大4倍? [英] Why exported HBase table is 4 times bigger than its original?
问题描述
我需要在更新到新版本之前备份HBase表。我决定使用标准导出工具将表格导出到hdfs,然后移动它到本地文件系统。出于某种原因,导出表比原始表大4倍:
hdfs dfs -du -h
1.4T备份-my-table
hdfs dfs -du -h / hbase / data / default /
417G my-table
可能是什么原因?它是否与压缩有关?
也许我做备份的方式很重要。首先,我从目标表格中制作了快照 ,然后将克隆到复制表,然后从这个复制的表中删除不必要的列族(所以我期望结果大小减少2倍),然后在这个副本表上运行导出工具。
upd适用于将来的访问者:这里有正确的命令来导出压缩表
./ hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapreduce.output.fileoutputformat.compress = true \
-Dmapreduce.output.fileoutputformat.compress .codec = org.apache.hadoop.io.compress.GzipCodec \
-Dmapreduce.output.fileoutputformat.compress.type = BLOCK \
-Dhbase.client.scanner.caching = 200 \
table-to-export export-dir
可以使用 SNAPPY
或其他压缩技术。像这样
create't1',{NAME => 'cf1',COMPRESSION => 'SNAPPY'}
Compression support Check
使用 CompressionTest
验证snappy支持已启用,并且libs可以加载到群集的所有节点上:
$ hbase org.apache.hadoop.hbase.util .CompressionTest hdfs:// host / path / to / hbase snappy
导出命令源以应用压缩:
见下面的属性,可以大大减小尺寸..
mapreduce.output.fileoutputformat.compress = true
mapreduce.output.fileoutputformat.compress.codec = org .apache.hadoop.io.compress.GzipCodec
mapreduce.output.fileoutputformat.compress.type = BLOCK
/ *
* @param errorMsg错误消息。可以为null。
* /
private static void usage(final String errorMsg){
if(errorMsg!= null&&& errorMsg.length()> 0){
System。 err.println(ERROR:+ errorMsg);
}
System.err.println(用法:导出[-D< property = value>] *< tablename>< outputdir> [< versions> +
[< starttime> [< endtime>]] [^ [regex pattern]或[Prefix] to filter]] \ n);
System.err.println(注意:-D属性将应用于使用的conf。);
System.err.println(例如:);
System.err.println(-D mapreduce.output.fileoutputformat.compress = true);
System.err.println(-D mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.GzipCodec);
System.err.println(-D mapreduce.output.fileoutputformat.compress.type = BLOCK);
System.err.println(此外,可以指定以下扫描属性);
System.err.println(控制/限制导出的内容..);
System.err.println(-D+ TableInputFormat.SCAN_COLUMN_FAMILY +=< familyName>);
System.err.println(-D+ RAW_SCAN += true);
System.err.println(-D+ TableInputFormat.SCAN_ROW_START +=< ROWSTART>);
System.err.println(-D+ TableInputFormat.SCAN_ROW_STOP +=< ROWSTOP>);
System.err.println(-D+ JOB_NAME_CONF_KEY
+= jobName - 使用指定的mapreduce作业名称进行导出);
System.err.println(对于性能请考虑以下属性:\\\
+-Dhbase.client.scanner.caching = 100 \\\
+-Dmapreduce .map.speculative = false \
+-Dmapreduce.reduce.speculative = false);
System.err.println(对于具有很宽行的表,考虑设置批量大小如下:\\\
+-D+ EXPORT_BATCHING += 10);
}
另请参阅 getExportFilter
这可能会有助于缩小你的出口范围。
private static Filter getExportFilter(String [] args){
138过滤器exportFilter = null;
139 String filterCriteria =(args.length> 5)? args [5]:null;
140 if(filterCriteria == null)return null;
141 if(filterCriteria.startsWith(^)){
142 String regexPattern = filterCriteria.substring(1,filterCriteria.length());
143 exportFilter = new RowFilter(CompareOp.EQUAL,new RegexStringComparator(regexPattern));
144} else {
145 exportFilter = new PrefixFilter(Bytes.toBytesBinary(filterCriteria));
146}
147 return exportFilter;
148}
I need to backup HBase table before update to a newer version. I decided to export table to hdfs with standard Export tool and then move it to local file system. For some reason exported table is 4 times larger than original one:
hdfs dfs -du -h
1.4T backup-my-table
hdfs dfs -du -h /hbase/data/default/
417G my-table
What can be the reason? Is it somehow related to compression?
P.S. Maybe the way I made the backup matters. First I made a snapshot from target table, then cloned it to a copy table, then deleted unnecessary column families from this copied table (so I expected the result size to be 2 times smaller), then I run export tool on this copy table.
upd for future visitors: here's the correct command to export table with compression
./hbase org.apache.hadoop.hbase.mapreduce.Export \
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapreduce.output.fileoutputformat.compress.type=BLOCK \
-Dhbase.client.scanner.caching=200 \
table-to-export export-dir
May be you compressed using SNAPPY
or some other compression technique. like this
create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
Compression support Check
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy
Export Command Source to apply compression:
If you dig deep to understand Export command (source), then you will find
see below properties which could reduce size drastically..
mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
mapreduce.output.fileoutputformat.compress.type=BLOCK
/*
* @param errorMsg Error message. Can be null.
*/
private static void usage(final String errorMsg) {
if (errorMsg != null && errorMsg.length() > 0) {
System.err.println("ERROR: " + errorMsg);
}
System.err.println("Usage: Export [-D <property=value>]* <tablename> <outputdir> [<versions> " +
"[<starttime> [<endtime>]] [^[regex pattern] or [Prefix] to filter]]\n");
System.err.println(" Note: -D properties will be applied to the conf used. ");
System.err.println(" For example: ");
System.err.println(" -D mapreduce.output.fileoutputformat.compress=true");
System.err.println(" -D mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec");
System.err.println(" -D mapreduce.output.fileoutputformat.compress.type=BLOCK");
System.err.println(" Additionally, the following SCAN properties can be specified");
System.err.println(" to control/limit what is exported..");
System.err.println(" -D " + TableInputFormat.SCAN_COLUMN_FAMILY + "=<familyName>");
System.err.println(" -D " + RAW_SCAN + "=true");
System.err.println(" -D " + TableInputFormat.SCAN_ROW_START + "=<ROWSTART>");
System.err.println(" -D " + TableInputFormat.SCAN_ROW_STOP + "=<ROWSTOP>");
System.err.println(" -D " + JOB_NAME_CONF_KEY
+ "=jobName - use the specified mapreduce job name for the export");
System.err.println("For performance consider the following properties:\n"
+ " -Dhbase.client.scanner.caching=100\n"
+ " -Dmapreduce.map.speculative=false\n"
+ " -Dmapreduce.reduce.speculative=false");
System.err.println("For tables with very wide rows consider setting the batch size as below:\n"
+ " -D" + EXPORT_BATCHING + "=10");
}
Also see getExportFilter
which might be useful in your case to narrow your export.
private static Filter getExportFilter(String[] args) {
138 Filter exportFilter = null;
139 String filterCriteria = (args.length > 5) ? args[5]: null;
140 if (filterCriteria == null) return null;
141 if (filterCriteria.startsWith("^")) {
142 String regexPattern = filterCriteria.substring(1, filterCriteria.length());
143 exportFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator(regexPattern));
144 } else {
145 exportFilter = new PrefixFilter(Bytes.toBytesBinary(filterCriteria));
146 }
147 return exportFilter;
148 }
这篇关于为什么导出HBase表比它的原始大4倍?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!