HyperTable:使用Mutators Vs加载数据.加载数据文件 [英] HyperTable: Loading data using Mutators Vs. LOAD DATA INFILE

查看:190
本文介绍了HyperTable:使用Mutators Vs加载数据.加载数据文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开始讨论,我希望该讨论将成为讨论使用可变器Vs的数据加载方法的地方.通过"LOAD DATA INFILE"使用平面文件进行加载.

I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.

我很困惑地使用增变器(使用批量大小= 1000或10000或100K等)来获得巨大的性能提升.

I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).

我的项目涉及将近4亿行社交媒体数据加载到HyperTable中以用于实时分析.我花了近三天时间才仅加载100万行数据(下面的代码示例).每行大约为32个字节.因此,为了避免花费2-3周的时间来加载这么多的数据,我准备了一个带有行的平面文件,并使用了DATA LOAD INFILE方法.性能提升是惊人的.使用此方法,加载速率为368336个细胞/秒.

My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.

See below for actual snapshot of action:

hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;


Loading 7,113,154,337 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%          
|----|----|----|----|----|----|----|----|----|----|             
***************************************************             
Load complete.                                                  

 Elapsed time:  508.07 s                                       
 Avg key size:  8.92 bytes                                     
  Total cells:  218976067                                      
   Throughput:  430998.80 cells/s                              
      Resends:  2210404                                        


hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;

Loading 12,693,476,187 bytes of input data...                    

0%   10   20   30   40   50   60   70   80   90   100%           
|----|----|----|----|----|----|----|----|----|----|
***************************************************              
Load complete.                                                   

 Elapsed time:  1189.71 s                                       
 Avg key size:  17.48 bytes                                     
  Total cells:  437952134                                       
   Throughput:  368118.13 cells/s                               
      Resends:  1483209 

为什么两种方法之间的性能差异如此之大?增强mutator性能的最佳方法是什么.下面是示例更改程序代码:

Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:

my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);

我希望对此有何投入?我没有大量的HyperTable经验.

I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.

谢谢.

推荐答案

如果要花三天的时间来加载一百万行,那么您可能会在每行插入后调用flush(),这不是正确的选择.在我描述热修复之前,您的mutator_open()参数不太正确.您无需指定ignore_unknown_cfs,并且应该为flush_interval提供0,如下所示:

If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:

my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);

仅当您想检查要消耗多少输入数据时,才应调用mutator_flush().成功调用mutator_flush()意味着已在该mutator上插入的所有数据都已持久地放入数据库中.如果您不检查要消耗多少输入数据,则无需调用mutator_flush(),因为在关闭mutator时它将自动刷新.

You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.

我看到的代码的下一个性能问题是您正在使用mutator_set_cell().您应该使用mutator_set_cells()或mutator_set_cells_as_arrays(),因为每个方法调用都是ThriftBroker的往返行程,这很昂贵.通过使用mutator_set_cells_ *方法,您可以在许多单元上摊销该往返行程.对于对象构造开销比本地数据类型(例如字符串)大的语言,mutator_set_cells_as_arrays()方法可能更有效.我不确定Perl,但是您可能想尝试一下它是否可以提高性能.

The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.

另外,一定要在完成mutator之后调用mutator_close().

Also, be sure to call mutator_close() when you're finished with the mutator.

这篇关于HyperTable:使用Mutators Vs加载数据.加载数据文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆