我应该学会/使用麻preduce,或一些其他类型的并行化这项任务? [英] Should I learn/use MapReduce, or some other type of parallelization for this task?

查看:189
本文介绍了我应该学会/使用麻preduce,或一些其他类型的并行化这项任务?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与我从谷歌的一位朋友交谈后,我想实现某种工作/工人模型更新我的数据集。

After talking with a friend of mine from Google, I'd like to implement some kind of Job/Worker model for updating my dataset.

此数据集反映了一个第三方服务的数据,因此,执行更新,我需要让多个远程调用他们的API。我想了很多时间将用于等待来自该第三方的服务响应。我想加快速度,并更好地利用我的计算时间,通过并行这些请求,并让许多人开一次,因为他们等待他们的个体反应。

This dataset mirrors a 3rd party service's data, so, to do the update, I need to make several remote calls to their API. I think a lot of time will be spent waiting for responses from this 3rd party service. I'd like to speed things up, and make better use of my compute hours, by parallelizing these requests and keeping many of them open at once, as they wait for their individual responses.

在我解释一下我的具体数据集,并进入问题,我想澄清我在寻找什么答案:

Before I explain my specific dataset and get into the problem, I'd like to clarify what answers I'm looking for:

  1. 这是一个流程,这将是非常适合与马preduce并行?
  2. 如果,这将是经济有效的方法对亚马逊的马preduce模块运行,按小时该法案,并四舍五入小时的起来的时候作业完成? (我不知道到底什么算是一个作业,所以我不知道我到底需要结算的)
  3. 如果,难道还有其他的系统/模式,我应该使用? 有一个库,这将帮助我在Python做到这一点(在AWS,usign EC2 + EBS)?
  4. 有没有你我设计这个工作流的方式看到任何问题?
  1. Is this a flow that would be well suited to parallelizing with MapReduce?
  2. If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? (I'm not sure exactly what counts as a "Job", so I don't know exactly how I'll be billed)
  3. If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS)?
  4. Are there any problems you see with the way I've designed this job flow?

好了,现在到了细节:

该数据集包括谁拥有最喜欢的项目,谁跟随其他用户的用户。这样做的目的是为了能够更新每个用户的队列 - 项目列表中,用户将看到当他们加载网页的基础上,她遵循了用户的喜爱的物品。但是,这样我才可以紧缩的数据,并更新用户的队列中,我需要确保我有最先进的最新数据,这哪里是API调用进来了。

The dataset consists of users who have favorite items and who follow other users. The aim is to be able to update each user's queue -- the list of items the user will see when they load the page, based on the favorite items of the users she follows. But, before I can crunch the data and update a user's queue, I need to make sure I have the most up-to-date data, which is where the API calls come in.

有两个电话,我可以做:

There are two calls I can make:

  • 获取其次用户的 - 它返回被随后的用户的所有用户,而
  • 获取心仪的物品的 - 它返回的用户的所有喜欢的物品
  • Get Followed Users -- Which returns all the users being followed by the requested user, and
  • Get Favorite Items -- Which returns all the favorite items of the requested user.

我通话后的获取其次用户的用户正在更新,我需要更新所遵循的最喜欢的项目为每个用户。只有当所有的收藏都返回所有被跟踪的用户,我可以开始处理队列中的原始用户。这个流程是这样的:

After I call get followed users for the user being updated, I need to update the favorite items for each user being followed. Only when all of the favorites are returned for all the users being followed can I start processing the queue for that original user. This flow looks like:

在此流程作业包括:

  • 开始更新队列的用户 - 揭开序幕的过程中通过获取更新的用户后跟用户,将它们存储,然后创建的获取收​​藏夹的作业每用户。
  • 获取收​​藏夹的用户 - 请求,并存储,收藏夹为指定的用户列表,从第三方服务
  • 计算新队列的用于用户 - 进程新的队列,现在,所有的数据被取出,然后将结果存储在用于由应用层高速缓存。
  • Start Updating Queue for user -- kicks off the process by fetching the users followed by the user being updated, storing them, and then creating Get Favorites jobs for each user.
  • Get Favorites for user -- Requests, and stores, a list of favorites for the specified user, from the 3rd party service.
  • Calculate New Queue for user -- Processes a new queue, now that all the data has been fetched, and then stores the results in a cache which is used by the application layer.

所以,再一次,我的问题是:

So, again, my questions are:

  1. 这是一个流程,这将是非常适合与马preduce并行?我不知道这是否会让我启动过程UserX,获取所有相关数据,并回过头来处理UserX的队列只有在这一切都完成。
  2. 如果,这将是经济有效的方法对亚马逊的马preduce模块运行,按小时该法案,并四舍五入小时的起来的时候作业完成?有多少线的限制,我可以有等待开放的API请求,如果我用自己的模块?
  3. 如果,难道还有其他的系统/模式,我应该使用? 有一个库,这将帮助我在Python做到这一点(在AWS,usign EC2 + EBS?)?
  4. 有没有你我设计这个工作流的方式看到任何问题?
  1. Is this a flow that would be well suited to parallelizing with MapReduce? I don't know if it would let me start the process for UserX, fetch all the related data, and come back to processing UserX's queue only after that's all done.
  2. If yes, would this be cost effective to run on Amazon's mapreduce module, which bills by the hour, and rounds hour's up when the job is complete? Is there a limit on how many "threads" I can have waiting on open API requests if I use their module?
  3. If no, Is there another system/pattern I should use? and Is there a library that will help me do this in python (On AWS, usign EC2 + EBS?)?
  4. Are there any problems you see with the way I've designed this job flow?

感谢您的阅读,我期待着与大家进行一些讨论。

Thanks for reading, I'm looking forward to some discussion with you all.

修改,以应对JimR:

感谢了坚实的答复。在我的阅读,因为我写了原来的问题,我从用麻preduce俯下身而去。我还没有决定我肯定还没有想怎么建立这个,但我开始感到麻preduce是分发/并行计算负载时,我真的只是希望并行的HTTP请求更好。

Thanks for a solid reply. In my reading since I wrote the original question, I've leaned away from using MapReduce. I haven't decided for sure yet how I want to build this, but I'm beginning to feel MapReduce is better for distributing / parallelizing computing load when I'm really just looking to parallelize HTTP requests.

什么是我的减少的任务,这需要所有所取得的数据和仰卧起坐成的结果,是不是计算密集型的部分。我是pretty的肯定它会被清盘,其执行的一两秒钟每个用户一个很大的SQL查询。

What would have been my "reduce" task, the part that takes all the fetched data and crunches it into results, isn't that computationally intensive. I'm pretty sure it's going to wind up being one big SQL query that executes for a second or two per user.

那么,我倾向于是:

  • 非马preduce 工作/工人模型,写在的Python 。我的一个朋友谷歌使我学习上的Python对于这一点,因为它的低开销,并很好地进行扩展。
  • 使用亚马逊EC2作为计算层。我想这意味着我还需要一个EBS片来存储我的数据库。
  • 在可能使用亚马逊的简单消息队列啄。这听起来像这3亚马逊小部件被设计来跟踪作业的队列,移动从一个任务成果转化为另一种输入和妥善处理失败的任务。这是非常便宜的。可能是值得推行的,而不是一个自定义的工作队列系统。
  • A non-MapReduce Job/Worker model, written in Python. A google friend of mine turned me onto learning Python for this, since it's low overhead and scales well.
  • Using Amazon EC2 as a compute layer. I think this means I also need an EBS slice to store my database.
  • Possibly using Amazon's Simple Message queue thingy. It sounds like this 3rd amazon widget is designed to keep track of job queues, move results from one task into the inputs of another and gracefully handle failed tasks. It's very cheap. May be worth implementing instead of a custom job-queue system.

推荐答案

看来,我们将与 Node.js的 SEQ 流量控制库。这是很容易的过程我图/流程图移动到的code一斯图布,现在​​它的填写了code挂接到右边的API只是一个问题。

Seems that we're going with Node.js and the Seq flow control library. It was very easy to move from my map/flowchart of the process to a stubb of the code, and now it's just a matter of filling out the code to hook into the right APIs.

感谢您的答案,他们是一个很大的帮助找到我一直在寻找的解决方案。

Thanks for the answers, they were a lot of help finding the solution I was looking for.

这篇关于我应该学会/使用麻preduce,或一些其他类型的并行化这项任务?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆