好的MapReduce示例 [英] Good MapReduce examples

查看:54
本文介绍了好的MapReduce示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除了如何使用MapReduce计算长文本中的单词数"任务以外,我没有想到任何其他好的示例.我发现这并不是让其他人印象深刻的最好的例子.

I couldn't think of any good examples other than the "how to count words in a long text with MapReduce" task. I found this wasn't the best example to give others an impression of how powerful this tool can be.

我不是在寻找代码片段,实际上只是在寻找文本"示例.

I'm not looking for code-snippets, really just "textual" examples.

推荐答案

Map reduce是一个框架,旨在有效地处理大量数据. 例如,如果我们在数据集中有100万条记录,并且以关系表示形式存储,则导出值并对其进行任何类型的转换将非常昂贵.

Map reduce is a framework that was developed to process massive amounts of data efficiently. For example, if we have 1 million records in a dataset, and it is stored in a relational representation - it is very expensive to derive values and perform any sort of transformations on these.

例如,在SQL中,给定出生日期,要找出多少个年龄大于30的人有100万条记录,这将需要一段时间,并且仅当查询的复杂性增加时,这才按先后顺序增加. Map Reduce提供了一个基于集群的实现,其中数据以分布式方式进行处理

For Example In SQL, Given the Date of Birth, to find out How many people are of age > 30 for a million records would take a while, and this would only increase in order of magnitute when the complexity of the query increases. Map Reduce provides a cluster based implementation where data is processed in a distributed manner

这是维基百科的文章,解释了 map-reduce的全部内容

Here is a wikipedia article explaining what map-reduce is all about

另一个很好的例子是通过map reduce查找朋友可以成为理解概念的有力例子,并且 一个很好用的案例.

Another good example is Finding Friends via map reduce can be a powerful example to understand the concept, and a well used use-case.

个人认为此链接对于理解该概念非常有用

Personally, found this link quite useful to understand the concept

复制博客中提供的说明(以防链接过期)

寻找朋友

MapReduce是最初由Google开发的框架,该框架允许 轻松实现跨多个领域的大规模分布式计算. Apache Hadoop是一个开源实现.

Finding Friends

MapReduce is a framework originally developed at Google that allows for easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation.

我会详细介绍细节,但是归结为定义两个 函数:映射函数和归约函数.地图功能 接受一个值并输出key:value对.例如,如果我们定义 一个映射函数,它接受一个字符串并输出单词的长度 作为键,单词本身作为值,然后map(steve)将 返回5:steve,而map(savannah)将返回8:savannah.你可能有 注意到map函数是无状态的,只需要输入 值以计算其输出值.这使我们可以运行地图 对价值并行运作,并提供了巨大的优势. 在使用reduce函数之前,mapreduce框架进行了分组 通过键将所有值一起计算,因此,如果map函数输出 以下key:value对:

I'll gloss over the details, but it comes down to defining two functions: a map function and a reduce function. The map function takes a value and outputs key:value pairs. For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then map(steve) would return 5:steve and map(savannah) would return 8:savannah. You may have noticed that the map function is stateless and only requires the input value to compute it's output value. This allows us to run the map function against values in parallel and provides a huge advantage. Before we get to the reduce function, the mapreduce framework groups all of the values together by key, so if the map functions output the following key:value pairs:

3 : the
3 : and
3 : you
4 : then
4 : what
4 : when
5 : steve
5 : where
8 : savannah
8 : research

他们分组为:

3 : [the, and, you]
4 : [then, what, when]
5 : [steve, where]
8 : [savannah, research]

这些行中的每一行都将作为参数传递给reduce 函数,它接受一个键和一个值列表.在这种情况下, 我们可能会试图找出多少个特定长度的单词 存在,所以我们的reduce函数将只计算其中的项目数 列表并输出具有列表大小的密钥,例如:

Each of these lines would then be passed as an argument to the reduce function, which accepts a key and a list of values. In this instance, we might be trying to figure out how many words of certain lengths exist, so our reduce function will just count the number of items in the list and output the key with the size of the list, like:

3 : 3
4 : 3
5 : 2
8 : 2

减少也可以并行完成,再次提供了巨大的 优势.然后,我们可以查看这些最终结果并看到 在我们的语料库中只有两个长度为5的单词,等等...

The reductions can also be done in parallel, again providing a huge advantage. We can then look at these final results and see that there were only two words of length 5 in our corpus, etc...

最常见的mapreduce示例是计算 单词在语料库中出现的时间.假设您拥有互联网的副本 (我很幸运能够在这种情况下工作),以及 您想要列出互联网上每个单词的列表以及 发生的次数.

The most common example of mapreduce is for counting the number of times words occur in a corpus. Suppose you had a copy of the internet (I've been fortunate enough to have worked in such a situation), and you wanted a list of every word on the internet as well as how many times it occurred.

您将采用的方法是将您的文档标记化 拥有(将其分解为单词),并将每个单词传递给映射器.映射器 然后会将单词连同1值一起吐出.这 分组阶段将获取所有键(在本例中为单词),并进行 1的列表.然后,reduce阶段获取一个键(单词)和一个列表 (密钥每次出现在Internet上的1的列表),以及 汇总列表.然后,reducer输出单词,以及它的 数数.说完所有内容后,您将获得每个单词的清单 互联网以及它出现了多少次.

The way you would approach this would be to tokenize the documents you have (break it into words), and pass each word to a mapper. The mapper would then spit the word back out along with a value of 1. The grouping phase will take all the keys (in this case words), and make a list of 1's. The reduce phase then takes a key (the word) and a list (a list of 1's for every time the key appeared on the internet), and sums the list. The reducer then outputs the word, along with it's count. When all is said and done you'll have a list of every word on the internet, along with how many times it appeared.

容易,对吧?如果您曾经阅读过有关mapreduce的内容,那么以上情况 并不是什么新东西……它是mapreduce的"Hello,World".所以这是 真实的用例(Facebook可能会也可能不会这样做) 以下,仅是一个示例):

Easy, right? If you've ever read about mapreduce, the above scenario isn't anything new... it's the "Hello, World" of mapreduce. So here is a real world use case (Facebook may or may not actually do the following, it's just an example):

Facebook有一个朋友列表(请注意,朋友是双向的 在Facebook上的东西.如果我是您的朋友,您是我的.他们还有 大量的磁盘空间,它们可以处理数亿个请求 每天.他们决定在可能的情况下预先计算计算结果. 减少请求的处理时间.一个共同的处理请求 是您和Joe有230个共同的朋友"功能.当你 访问某人的个人资料,您会看到您所拥有的朋友列表 常见的.此列表不会经常更改,因此浪费时间 每次您访问个人资料时都要重新计算一次(确保可以使用 一个不错的缓存策略,但是那样我就无法继续 针对这个问题撰写有关mapreduce的文章).我们将使用 mapreduce,以便我们可以一次计算每个人的共同朋友 并存储这些结果.稍后只是快速查找.我们已经 有很多磁盘,很便宜.

Facebook has a list of friends (note that friends are a bi-directional thing on Facebook. If I'm your friend, you're mine). They also have lots of disk space and they serve hundreds of millions of requests everyday. They've decided to pre-compute calculations when they can to reduce the processing time of requests. One common processing request is the "You and Joe have 230 friends in common" feature. When you visit someone's profile, you see a list of friends that you have in common. This list doesn't change frequently so it'd be wasteful to recalculate it every time you visited the profile (sure you could use a decent caching strategy, but then I wouldn't be able to continue writing about mapreduce for this problem). We're going to use mapreduce so that we can calculate everyone's common friends once a day and store those results. Later on it's just a quick lookup. We've got lots of disk, it's cheap.

假设朋友存储为Person-> [朋友列表],我们的 朋友列表如下:

Assume the friends are stored as Person->[List of Friends], our friends list is then:

A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E
E -> B C D

每行都是映射器的参数.对于中的每个朋友 在好友列表中,映射器将输出一个键值对.钥匙会 和这个人成为朋友.该值将是 朋友们.钥匙将进行排序,以便朋友井井有条, 使所有成对的朋友都使用同一个减速器.这很难 用文本来解释,所以让我们来做一下,看看是否可以看到 图案.在所有映射器完成运行之后,您将获得一个列表 像这样:

Each line will be an argument to a mapper. For every friend in the list of friends, the mapper will output a key-value pair. The key will be a friend along with the person. The value will be the list of friends. The key will be sorted so that the friends are in order, causing all pairs of friends to go to the same reducer. This is hard to explain with text, so let's just do it and see if you can see the pattern. After all the mappers are done running, you'll have a list like this:

For map(A -> B C D) :

(A B) -> B C D
(A C) -> B C D
(A D) -> B C D

For map(B -> A C D E) : (Note that A comes before B in the key)

(A B) -> A C D E
(B C) -> A C D E
(B D) -> A C D E
(B E) -> A C D E
For map(C -> A B D E) :

(A C) -> A B D E
(B C) -> A B D E
(C D) -> A B D E
(C E) -> A B D E
For map(D -> A B C E) :

(A D) -> A B C E
(B D) -> A B C E
(C D) -> A B C E
(D E) -> A B C E
And finally for map(E -> B C D):

(B E) -> B C D
(C E) -> B C D
(D E) -> B C D
Before we send these key-value pairs to the reducers, we group them by their keys and get:

(A B) -> (A C D E) (B C D)
(A C) -> (A B D E) (B C D)
(A D) -> (A B C E) (B C D)
(B C) -> (A B D E) (A C D E)
(B D) -> (A B C E) (A C D E)
(B E) -> (A C D E) (B C D)
(C D) -> (A B C E) (A B D E)
(C E) -> (A B D E) (B C D)
(D E) -> (A B C E) (B C D)

每行将作为参数传递给reducer.减少 函数将简单地将值列表相交并输出相同的值 键与交集的结果.例如,reduce((A B)-> (A C D E)(B C D))将输出(A B):(C D)并表示朋友A 和B拥有C和D作为共同的朋友.

Each line will be passed as an argument to a reducer. The reduce function will simply intersect the lists of values and output the same key with the result of the intersection. For example, reduce((A B) -> (A C D E) (B C D)) will output (A B) : (C D) and means that friends A and B have C and D as common friends.

还原后的结果是:

(A B) -> (C D)
(A C) -> (B D)
(A D) -> (B C)
(B C) -> (A D E)
(B D) -> (A C E)
(B E) -> (C D)
(C D) -> (A B E)
(C E) -> (B D)
(D E) -> (B C)

现在,当D访问B的个人资料时,我们可以快速查找(B D)并查看 他们有三个共同的朋友(A C E).

Now when D visits B's profile, we can quickly look up (B D) and see that they have three friends in common, (A C E).

这篇关于好的MapReduce示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆