200 亿行/月 - Hbase/Hive/Greenplum/什么? [英] 20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

查看:30
本文介绍了200 亿行/月 - Hbase/Hive/Greenplum/什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想利用您的智慧为数据仓库系统挑选合适的解决方案.这里有一些细节可以更好地理解问题:

I'd like to use your wisdom for picking up the right solution for a data-warehouse system. Here are some details to better understand the problem:

数据以星型结构组织,具有一个 BIG 事实和约 15 个维度.
每月 20B 个事实行
10 个维度,百行(有点层次)
5 维数千行
2 维,约 200K 行
2 个大尺寸,50M-100M 行

Data is organized in a star schema structure with one BIG fact and ~15 dimensions.
20B fact rows per month
10 dimensions with hundred rows (somewhat hierarchy)
5 dimensions with thousands rows
2 dimensions with ~200K rows
2 big dimensions with 50M-100M rows

针对该数据库运行两个典型查询

Two typical queries run against this DB

dimq 中的顶级成员:

select    top X dimq, count(id) 
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 
group by  dimq 
order by  count(id) desc

针对元组的措施:

select    count(distinct dis1), count (distinct dis2), count(dim1), count(dim2),...
from      fact 
where     dim1 = x and dim2 = y and dim3 = z 

问题:

  1. 执行此类查询的最佳平台是什么
  2. 需要什么样的硬件
  3. 可以在哪里托管(EC2?)

  1. What is the best platform to perform such queries
  2. What kind of hardware needed
  3. Where can it be hosted (EC2?)


(请暂时忽略导入和加载问题)


(please ignore importing and loading issues at the moment)

Tnx,
哈盖.

Tnx,
Haggai.

推荐答案

这点我怎么强调都不为过:得到一些与现成的报告工具配合得很好的东西.

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

每月 200 亿行将您置于 VLDB 领域,因此您需要分区.低基数维度也表明位图索引会带来性能优势.

20 Billion rows per month puts you in VLDB territory, so you need partitioning. The low cardinality dimensions would also suggest that bitmap indexes would be a performance win.

  • 忘记云系统(HiveHbase) 直到他们有成熟的 SQL 支持.对于数据仓库应用程序你想要的东西与传统的作品报告工具.否则,你会发现自己永远陷入写作和维护的困境临时报告程序.

  • Forget the cloud systems (Hive, Hbase) until they have mature SQL support. For a data warehouse application you want something that works with conventional reporting tools. Otherwise, you will find yourself perpetually bogged down writing and maintaining ad-hoc report programs.

数据量可通过以下方式进行管理像 Oracle 这样更传统的 DBMS - 我知道一个 主要欧洲电信公司,每天加载 600GB进入 Oracle 数据库.所有其他事情是平等的,这是两个订单数量级大于您的数据量,所以共享磁盘架构仍然有给你的空间.一个shared-nothing 架构NetezzaTeradata 可能是更快,但这些卷是不在超出传统的共享磁盘系统.但请记住,这些系统都是相当昂贵.

The data volumes are manageable with a more conventional DBMS like Oracle - I know of a major European telco that loads 600GB per day into an Oracle database. All other things being equal, that's two orders of magnitude bigger than your data volumes, so shared disk architectures still have headroom for you. A shared-nothing architecture like Netezza or Teradata will probably be faster still but these volumes are not at a level that is beyond a conventional shared-disk system. Bear in mind, though, that these systems are all quite expensive.

还要记住,MapReduce 不是有效的查询选择算法.它是基本上是一种分配蛮力的机制计算.青梅确实有一个 MapReduce 后端,但是一个专门构建的不共享任何东西引擎会更有效率用更少的钱完成更多的工作硬件.

Also bear in mind that MapReduce is not an efficient query selection algorithm. It is fundamentally a mechanism for distributing brute-force computations. Greenplum does have a MapReduce back-end, but a purpose-built shared nothing engine will be a lot more efficient and get more work done for less hardware.

我对此的看法是,Teradata 或 Netezza 可能是完成这项工作的理想工具,但绝对是最昂贵的.Oracle, Sybase IQ 甚至 SQL Server 也会处理所涉及的数据量,但速度会慢一些——它们是共享磁盘架构,但仍然可以管理这种数据量.请参阅 此帖子 了解 Oracle 和 SQL Server 中 VLDB 相关功能的概要,请记住 Oracle 刚刚引入了 Exadata 存储平台 也是.

My take on this is that Teradata or Netezza would probably be the ideal tool for the job but definitely the most expensive. Oracle, Sybase IQ or even SQL Server would also handle the data volumes involved but will be slower - they are shared disk architectures but can still manage this sort of data volume. See This posting for a rundown on VLDB related features in Oracle and SQL Server, and bear in mind that Oracle has just introduced the Exadata storage platform also.

我的备用数据包容量计划建议每月大约 3-5 TB,包括 Oracle 或 SQL Server 的索引.在 Oracle 上,位图索引可能更少,尽管索引叶在 oracle 上具有 16 字节的 ROWID,而在 SQL Server 上具有 6 字节的页面引用.

My back-of-a-fag-packet capacity plan suggests maybe 3-5 TB or so per month including indexes for Oracle or SQL Server. Probably less on Oracle with bitmap indexes, although an index leaf has a 16-byte ROWID on oracle vs. a 6 byte page reference on SQL Server.

Sybase IQ 广泛使用位图索引并针对数据仓库查询进行了优化.虽然是共享磁盘架构,但它对于这种类型的查询非常有效(IIRC 它是原始的面向列的架构).这可能比 Oracle 或 SQL Server 更好,因为它专门用于此类工作.

Sybase IQ makes extensive use of bitmap indexes and is optimized for data warehouse queries. Although a shared-disk architecture, it is very efficient for this type of query (IIRC it was the original column-oriented architecture). This would probably be better than Oracle or SQL Server as it is specialized for this type of work.

Greenplum 可能是一个更便宜的选择,但我从未真正使用过它,所以我无法评论它在实践中的效果如何.

Greenplum might be a cheaper option but I've never actually used it so I can't comment on how well it works in practice.

如果您有 10 个维度,只有几百行,请考虑将它们合并为一个 垃圾维度,它将通过将十个键合并为一个键来缩小您的事实表.您仍然可以在垃圾维度上实现层次结构,这将使事实表的大小减少 1/2 或更多,并消除索引对磁盘的大量使用.

If you have 10 dimensions with just a few hundred rows consider merging them into a single junk dimension which will slim down your fact table by merging the ten keys into just one. You can still implement hierarchies on a junk dimension and this would knock 1/2 or more off the size of your fact table and eliminate a lot of disk usage by indexes.

我强烈建议您使用与合理的跨部门报告工具配合得很好的东西.这意味着 SQL 前端. Crystal Reports 等商业系统a> 允许具有更容易获得的 SQL 技能的人员完成报告和分析.开源世界也产生了BIRT, Jasper 报告Pentaho..Hive 或 HBase 让您从事构建自定义前端的业务,除非您乐于在接下来的 5 年中用 Python 编写自定义报告格式化程序,否则您真的不想要这种前端.

I strongly recommend that you go with something that plays nicely with a reasonable cross-section of reporting tools. This means a SQL front end. Commercial systems like Crystal Reports allow reporting and analytics to be done by people with a more readily obtainable set of SQL skills. The open-source world has also generated BIRT, Jasper Reports and Pentaho.. Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

最后,将其托管在您可以轻松从生产系统获取快速数据馈送的地方.这可能意味着您在自己的数据中心拥有自己的硬件.该系统将受 I/O 限制;它对大量数据进行简单处理.这意味着您将需要具有快速磁盘子系统的机器.云提供商往往不支持这种类型的硬件,因为它比这些机构传统上使用的一次性 1U 盒子类型贵一个数量级.快速磁盘 I/O 不是云架构的优势.

Finally, host it somewhere you can easily get a fast data feed from your production systems. This probably means your own hardware in your own data centre. This system will be I/O bound; it's doing simple processing on large volumes of data. This means you will need machines with fast disk subsystems. Cloud providers tend not to support this type of hardware as it's an order of magnitude more expensive than the type of disposable 1U box traditionally used by these outfits. Fast Disk I/O is not a strength of cloud architectures.

这篇关于200 亿行/月 - Hbase/Hive/Greenplum/什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆