20亿行/月 - Hbase / Hive / Greenplum /什么? [英] 20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

查看:1288
本文介绍了20亿行/月 - Hbase / Hive / Greenplum /什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想利用您的智慧为数据仓库系统选择正确的解决方案。
下面是一些细节,以更好地理解这个问题:

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem:

数据被组织在一个星型模式结构中,一个BIG事实和〜15维度。


每月20B个事实行


10行(有些层次)


5维与千行


2维与〜200K行


2个大尺寸与50M-100M行

Data is organized in a star schema structure with one BIG fact and ~15 dimensions.
20B fact rows per month
10 dimensions with hundred rows (somewhat hierarchy)
5 dimensions with thousands rows
2 dimensions with ~200K rows
2 big dimensions with 50M-100M rows

针对此数据库运行两个典型查询

Two typical queries run against this DB

dimq中的顶尖成员:

select    top X dimq, count(id) 
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 
group by  dimq 
order by  count(id) desc

tuple:

Measures against a tuple:

select    count(distinct dis1), count (distinct dis2), count(dim1), count(dim2),...
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 

问题:


  1. 执行此类查询的最佳平台是


  2. 需要什么样的硬件


  3. ?)

  1. What is the best platform to perform such queries
  2. What kind of hardware needed
  3. Where can it be hosted (EC2?)



(请忽略导入和加载问题)


(please ignore importing and loading issues at the moment)

Tnx,


Haggai。

Tnx,
Haggai.

推荐答案

我不能强调这一点:获得与现成的报告工具。

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

每月20亿行将您置于VLDB领域,因此您需要进行分区。

20 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.


  • 忘记云系统( Hive
    Hbase ),直到他们有成熟的SQL支持。
    对于数据仓库
    应用程序,您需要的是
    与常规的
    报告工具配合使用。否则,
    会发现自己永远是
    陷入了写作和维护
    特别报告程序。

  • Forget the cloud systems (Hive, Hbase) until they have mature SQL support. For a data warehouse application you want something that works with conventional reporting tools. Otherwise, you will find yourself perpetually bogged down writing and maintaining ad-hoc report programs.

卷可以管理
a更传统的DBMS像Oracle - 我知道一个主要的欧洲电信运营商 600GB每天
加入 Oracle 数据库。所有其他
事情是相等的,这是两个订单
大于您的数据卷,
所以共享磁盘架构仍然有
的余量。 A
无共享的架构,如
Netezza Teradata 可能是
仍然更快,但这些卷是
不在一个水平超过
常规共享磁盘系统。请记住,这些系统都是
非常昂贵。

The data volumes are manageable with a more conventional DBMS like Oracle - I know of a major European telco that loads 600GB per day into an Oracle database. All other things being equal, that's two orders of magnitude bigger than your data volumes, so shared disk architectures still have headroom for you. A shared-nothing architecture like Netezza or Teradata will probably be faster still but these volumes are not at a level that is beyond a conventional shared-disk system. Bear in mind, though, that these systems are all quite expensive.

还要记住MapReduce是 not
一个有效的查询选择
算法
。它是
从根本上是一个分配暴力
计算的机制。 Greenplum
有一个MapReduce后端,但是一个特定的共享没有
引擎将是一个更高效的
和更少的
硬件做更多的工作。

Also bear in mind that MapReduce is not an efficient query selection algorithm. It is fundamentally a mechanism for distributing brute-force computations. Greenplum does have a MapReduce back-end, but a purpose-built shared nothing engine will be a lot more efficient and get more work done for less hardware.

我的看法是,Teradata或Netezza可能是这项工作的理想工具,但绝对是最昂贵的。
Oracle Sybase IQ 或甚至 SQL Server 还将处理涉及的数据卷,但会更慢 - 它们是共享磁盘架构,但仍然可以管理这种数据量。请参见此帖子查看Oracle和SQL Server中与VLDB相关的功能,并记住Oracle刚刚引入了 Exadata存储平台

My take on this is that Teradata or Netezza would probably be the ideal tool for the job but definitely the most expensive. Oracle, Sybase IQ or even SQL Server would also handle the data volumes involved but will be slower - they are shared disk architectures but can still manage this sort of data volume. See This posting for a rundown on VLDB related features in Oracle and SQL Server, and bear in mind that Oracle has just introduced the Exadata storage platform also.

我的back-of-a-fag-packet容量计划建议大概3-5 TB左右每月包括Oracle或SQL Server的索引。虽然索引叶在oracle上有一个16字节的ROWID,而在SQL Server上有6个字节的页面引用。

My back-of-a-fag-packet capacity plan suggests maybe 3-5 TB or so per month including indexes for Oracle or SQL Server. Probably less on Oracle with bitmap indexes, although an index leaf has a 16-byte ROWID on oracle vs. a 6 byte page reference on SQL Server.

使用位图索引并针对数据仓库查询进行了优化。虽然共享磁盘架构,对于这种类型的查询(IIRC它是原始的面向列的架构)是非常有效的。这可能比Oracle或SQL Server更好,因为它专门用于这种类型的工作。

Sybase IQ makes extensive use of bitmap indexes and is optimized for data warehouse queries. Although a shared-disk architecture, it is very efficient for this type of query (IIRC it was the original column-oriented architecture). This would probably be better than Oracle or SQL Server as it is specialized for this type of work.

Greenplum可能是一个更便宜的选项,但我从来没有实际使用它我不能评论它在实践中是如何工作。

Greenplum might be a cheaper option but I've never actually used it so I can't comment on how well it works in practice.

如果您有10个维度,只有几百行,请考虑将它们合并成一个垃圾维度,通过将十个键合并为一个,可以减少事实表。您仍然可以在垃圾维度上实现层次结构,这将敲掉事实表大小的1/2或更多,并通过索引消除大量磁盘使用。

If you have 10 dimensions with just a few hundred rows consider merging them into a single junk dimension which will slim down your fact table by merging the ten keys into just one. You can still implement hierarchies on a junk dimension and this would knock 1/2 or more off the size of your fact table and eliminate a lot of disk usage by indexes.

我强烈建议你用一个合理的报告工具交叉的东西。 这意味着SQL前端。商业系统,如 Crystal Reports < a>允许由具有更容易获得的一组SQL技能的人员执行报告和分析。开放源代码世界还生成了 BIRT Jasper Reports Pentaho。。 Hive或HBase让你建立一个自定义前端的业务,你真的不想要,除非你乐意花费未来5年在Python中编写自定义报告格式化程序。

I strongly recommend that you go with something that plays nicely with a reasonable cross-section of reporting tools. This means a SQL front end. Commercial systems like Crystal Reports allow reporting and analytics to be done by people with a more readily obtainable set of SQL skills. The open-source world has also generated BIRT, Jasper Reports and Pentaho.. Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

最后,将它托管在某个地方,您可以轻松地从生产系统获取快速数据Feed。这可能意味着您自己的硬件在您自己的数据中心。这个系统将是I / O绑定;它对大量数据进行简单处理。这意味着您将需要具有快速磁盘子系统的机器。云提供商倾向于不支持这种类型的硬件,因为它比传统上由这些服装使用的一次性1U盒的类型贵一个数量级。快速磁盘I / O不是云架构的强项。

Finally, host it somewhere you can easily get a fast data feed from your production systems. This probably means your own hardware in your own data centre. This system will be I/O bound; it's doing simple processing on large volumes of data. This means you will need machines with fast disk subsystems. Cloud providers tend not to support this type of hardware as it's an order of magnitude more expensive than the type of disposable 1U box traditionally used by these outfits. Fast Disk I/O is not a strength of cloud architectures.

这篇关于20亿行/月 - Hbase / Hive / Greenplum /什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆