地图缩小和关系数据库管理系统 [英] Map Reduce & RDBMS

查看:51
本文介绍了地图缩小和关系数据库管理系统的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读hadoop的权威指南,它写的是Map Reduce,它适合于更新数据库的较大部分,并且它使用Sort&合并以重建依赖于传输时间的数据库.

I was reading hadoop definitive guide , It was written Map Reduce is good for updating larger portions of the database , and it uses Sort & Merge to rebuild the database which is dependent on transfer time .

RDBMS还适合仅更新大型数据库的较小部分,它使用的B树受到查找时间的限制

Also RDBMS is good for updating only smaller portions of a big database , It uses a B-Tree which is limited by seek time

任何人都可以详细说明这两种说法的真正含义吗?

Can anyone elaborate on what both these claims really mean ?

推荐答案

我不确定这本书是什么意思,但是如果您仍然有原始数据,通常您会做一个地图简化工作来重建整个数据库/任何东西.

I am not really sure what the book means, but you will usually do a map reduce job to rebuild the entire database/anything if you still have the raw data.

hadoop真正的好处是它是分布式的,因此性能并不是真正的问题,因为您可以添加更多机器.

The real good thing about hadoop is that it's distributed, so performance is not really a problem since you could just add more machines.

让我们举个例子,您需要重建一个具有10亿行的复杂表.使用RDBMS,您只能垂直扩展,因此,您将更多地取决于CPU的功能以及算法的速度.您将使用一些SQL命令来完成此操作.您将需要选择一些数据,对其进行处理,进行填充等.因此,您很可能会受到搜索时间的限制.

Let's take an example, you need to rebuild a complex table with 1 billion rows. With RDBMS, you can only scale vertically, so you will be depending more on the power of the CPU, and how fast the algorithm is. You will be doing it with some SQL command. You will need to select a few data, process them, do stuffs, etc. So you will most likely be limited by the seek time.

使用hadoop map reduce,您可以添加更多机器,因此性能不是问题.假设您使用10000个映射器,这意味着该任务将被划分为10000个映射器容器,并且由于hadoop的性质,所有这些容器通常已经在本地将其硬盘上的数据存储了.每个映射器的输出始终是其本地硬盘驱动器上的键值结构格式.这些数据由映射器使用键进行排序.

With hadoop map reduce, you could just add more machines, so performance is not the problem. Let's say you you use 10000 mappers, that means the task will be divided to 10000 mapper containers, and because of hadoop's nature, all these containers usually already have the data on their harddrive stored locally. The output of each mapper is always a key value structured format on their local harddrive. These data are sorted using the key by the mapper.

现在的问题是,他们需要将数据组合在一起,因此所有这些数据都将被发送到减速器.这是通过网络发生的,如果您有大数据,通常是最慢的部分.精简器将接收所有数据,并对它们进行合并排序以进行进一步处理.最后,您可以将一个文件上传到数据库中.

Now the problem is, they need to combine the data together, so all of these data will be sent to a reducer. This happens through the network, is usually the slowest part if you have big data. The reducer will receive all of the data and will merge-sort them for further processing. In the end you have a file which could be just uploaded to your database.

如果您有大量数据,则从映射器到化简器的传输通常花费最长时间,而网络通常是您的瓶颈.也许这取决于传输时间.

The transfer from mapper to reducer is usually what's taking the longest time if you have a lot of data, and network is usually your bottleneck. Maybe this is what it meant by depending on the transfer time.

这篇关于地图缩小和关系数据库管理系统的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆