好的 MapReduce 示例 [英] Good MapReduce examples

查看:36
本文介绍了好的 MapReduce 示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了如何使用 MapReduce 计算长文本中的单词"任务之外,我想不出任何好的示例.我发现这并不是让其他人了解此工具的强大功能的最佳示例.

I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be.

我不是在寻找代码片段,实际上只是文本"示例.

I'm not looking for code-snippets, really just "textual" examples.

推荐答案

Map reduce 是一个为高效处理海量数据而开发的框架.例如,如果我们在数据集中有 100 万条记录,并且它以关系表示形式存储 - 派生值并对这些值执行任何类型的转换都是非常昂贵的.

Map reduce is a framework that was developed to process massive amounts of data efficiently. For example, if we have 1 million records in a dataset, and it is stored in a relational representation - it is very expensive to derive values and perform any sort of transformations on these.

例如在 SQL 中,在给定出生日期的情况下,要找出有多少人的年龄 > 30 岁以获得一百万条记录将需要一段时间,而且当查询的复杂性增加时,这只会按大小顺序增加.Map Reduce 提供基于集群的实现,其中数据以分布式方式处理

For Example In SQL, Given the Date of Birth, to find out How many people are of age > 30 for a million records would take a while, and this would only increase in order of magnitute when the complexity of the query increases. Map Reduce provides a cluster based implementation where data is processed in a distributed manner

这是一篇解释 map-reduce 的维基百科文章一个>

Here is a wikipedia article explaining what map-reduce is all about

另一个很好的例子是通过 map reduce 寻找朋友可以成为理解这个概念的有力例子,并且一个很好用的用例.

Another good example is Finding Friends via map reduce can be a powerful example to understand the concept, and a well used use-case.

个人觉得这个链接对理解这个概念很有帮助

Personally, found this link quite useful to understand the concept

复制博客中提供的解释(以防链接失效)

MapReduce 是最初由 Google 开发的框架,它允许用于跨多个域的轻松大规模分布式计算.Apache Hadoop 是一个开源实现.

Finding Friends

MapReduce is a framework originally developed at Google that allows for easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation.

我将掩盖细节,但归结为定义两个函数:一个map函数和一个reduce函数.地图功能接受一个值并输出键:值对.例如,如果我们定义一个 map 函数,它接受一个字符串并输出单词的长度作为键和单词本身作为值然后 map(steve) 将返回 5:steve 和 map(savannah) 将返回 8:savannah.你可能有注意到 map 函数是无状态的,只需要输入value 来计算它的输出值.这允许我们运行地图并行地对值起作用,并提供了巨大的优势.在我们进入reduce函数之前,mapreduce框架组通过键将所有值组合在一起,因此如果映射函数输出以下键:值对:

I'll gloss over the details, but it comes down to defining two functions: a map function and a reduce function. The map function takes a value and outputs key:value pairs. For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then map(steve) would return 5:steve and map(savannah) would return 8:savannah. You may have noticed that the map function is stateless and only requires the input value to compute it's output value. This allows us to run the map function against values in parallel and provides a huge advantage. Before we get to the reduce function, the mapreduce framework groups all of the values together by key, so if the map functions output the following key:value pairs:

3 : the
3 : and
3 : you
4 : then
4 : what
4 : when
5 : steve
5 : where
8 : savannah
8 : research

它们被分组为:

3 : [the, and, you]
4 : [then, what, when]
5 : [steve, where]
8 : [savannah, research]

然后这些行中的每一行都将作为参数传递给 reduce函数,它接受一个键和一个值列表.在这种情况下,我们可能试图弄清楚有多少特定长度的单词存在,所以我们的reduce函数只会计算项目的数量列表并输出具有列表大小的键,例如:

Each of these lines would then be passed as an argument to the reduce function, which accepts a key and a list of values. In this instance, we might be trying to figure out how many words of certain lengths exist, so our reduce function will just count the number of items in the list and output the key with the size of the list, like:

3 : 3
4 : 3
5 : 2
8 : 2

减少也可以并行完成,再次提供巨大的优势.然后我们可以查看这些最终结果并看到在我们的语料库中只有两个长度为 5 的单词,等等......

The reductions can also be done in parallel, again providing a huge advantage. We can then look at these final results and see that there were only two words of length 5 in our corpus, etc...

mapreduce 最常见的例子是计算词在语料库中出现的次数.假设你有一份互联网副本(我很幸运能够在这种情况下工作),并且你想要一个互联网上每个单词的列表以及有多少它发生的次数.

The most common example of mapreduce is for counting the number of times words occur in a corpus. Suppose you had a copy of the internet (I've been fortunate enough to have worked in such a situation), and you wanted a list of every word on the internet as well as how many times it occurred.

您处理此问题的方法是将您的文档标记化拥有(将其分解为单词),并将每个单词传递给映射器.映射器然后会将单词连同 1 的值一起吐出.这分组阶段将获取所有键(在本例中为单词),并制作一个1 的列表.然后,reduce 阶段需要一个键(单词)和一个列表(每次密钥出现在互联网上时的 1 列表),以及总结列表.然后 reducer 输出这个词,连同它的数数.当一切都说完了,你会得到一个每个单词的列表互联网,以及它出现的次数.

The way you would approach this would be to tokenize the documents you have (break it into words), and pass each word to a mapper. The mapper would then spit the word back out along with a value of 1. The grouping phase will take all the keys (in this case words), and make a list of 1's. The reduce phase then takes a key (the word) and a list (a list of 1's for every time the key appeared on the internet), and sums the list. The reducer then outputs the word, along with it's count. When all is said and done you'll have a list of every word on the internet, along with how many times it appeared.

很简单,对吧?如果你曾经读过 mapreduce,上面的场景不是什么新东西……它是 mapreduce 的Hello, World".所以这里是一个真实世界的用例(Facebook 可能会也可能不会真正做到以下,只是一个例子):

Easy, right? If you've ever read about mapreduce, the above scenario isn't anything new... it's the "Hello, World" of mapreduce. So here is a real world use case (Facebook may or may not actually do the following, it's just an example):

Facebook 有一个朋友列表(请注意,朋友是双向的脸书上的事.如果我是你的朋友,你就是我的).他们还有大量磁盘空间,它们服务于数亿个请求每天.他们决定在可能的情况下预先计算减少请求的处理时间.一种常见的处理请求是你和乔有 230 个共同的朋友"功能.当你访问某人的个人资料,您会看到您拥有的朋友列表常见的.此列表不会经常更改,因此很浪费每次访问配置文件时重新计算它(确保您可以使用一个不错的缓存策略,但是我将无法继续为这个问题写关于 mapreduce 的文章).我们将使用mapreduce 这样我们就可以计算出每个人的共同朋友一次天并存储这些结果.稍后它只是一个快速查找.我们已经有很多磁盘,很便宜.

Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If I'm your friend, you're mine). They also have lots of disk space and they serve hundreds of millions of requests everyday. They've decided to pre-compute calculations when they can to reduce the processing time of requests. One common processing request is the "You and Joe have 230 friends in common" feature. When you visit someone's profile, you see a list of friends that you have in common. This list doesn't change frequently so it'd be wasteful to recalculate it every time you visited the profile (sure you could use a decent caching strategy, but then I wouldn't be able to continue writing about mapreduce for this problem). We're going to use mapreduce so that we can calculate everyone's common friends once a day and store those results. Later on it's just a quick lookup. We've got lots of disk, it's cheap.

假设好友存储为 Person->[List of Friends],我们的那么朋友列表是:

Assume the friends are stored as Person->[List of Friends], our friends list is then:

A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D

每一行都是映射器的一个参数.送给身边的每一位朋友好友列表,映射器会输出一个键值对.关键将与此人成为朋友.该值将是列表朋友们.键将被排序,以便朋友按顺序排列,导致所有成对的朋友都去同一个减速器.这很难用文字解释,所以让我们这样做,看看你是否能看到图案.在所有映射器运行完毕后,您将拥有一个列表像这样:

Each line will be an argument to a mapper. For every friend in the list of friends, the mapper will output a key-value pair. The key will be a friend along with the person. The value will be the list of friends. The key will be sorted so that the friends are in order, causing all pairs of friends to go to the same reducer. This is hard to explain with text, so let's just do it and see if you can see the pattern. After all the mappers are done running, you'll have a list like this:

For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E
For map(D -> A B C E) :

(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
And finally for map(E -> B C D):

(B E) -> B C D
(C E) -> B C D
(D E) -> B C D
Before we send these key-value pairs to the reducers, we group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

每一行都将作为参数传递给reducer.减少函数将简单地与值列表相交并输出相同的值键与交集的结果.例如,reduce((A B) ->(A C D E) (B C D)) 将输出 (A B) : (C D) 表示朋友 A和 B 有 C 和 D 作为共同的朋友.

Each line will be passed as an argument to a reducer. The reduce function will simply intersect the lists of values and output the same key with the result of the intersection. For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that friends A and B have C and D as common friends.

还原后的结果是:

(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)

现在当 D 访问 B 的个人资料时,我们可以快速查找 (B D) 并看到他们有三个共同的朋友,(A C E).

Now when D visits B's profile, we can quickly look up (B D) and see that they have three friends in common, (A C E).

这篇关于好的 MapReduce 示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆