将Spark数据加载到Mongo/Memcached中,以供Web服务使用 [英] Load spark data into Mongo / Memcached for use by a Webservice

查看:111
本文介绍了将Spark数据加载到Mongo/Memcached中,以供Web服务使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个极度陌生的人,并且有一个与工作流程相关的特定问题.尽管它实际上并不是与编码有关的问题,但它更是与火花功能有关的问题,我认为这里比较合适.如果您认为此问题不适用于SO,请随时将我重定向到正确的网站.

I am extremely new to spark and have a specific workflow associated question. Although it is not really a coding related question, it is more a spark functionality related question and I thought it would be appropriate here. Please feel free to redirect me to the correct site if you think this question is inappropriate for SO.

所以去了: 1.我打算使用Spark的滑动窗口"功能消耗大量请求,并计算推荐模型.一旦计算出模型,网络服务就可以直接从RDD中查询和使用此数据吗?如果是这样,谁能指出我一些如何实现此目标的示例代码?

So here goes: 1. I am planning to consume a stream of requests using Spark's Sliding Window functionality and calculate a recommendation model. Once the model is calculated, would it be possible for a web-service to query and consume this data directly from an RDD? If so could anyone point me toward some example code of how this can be achieved?

  1. 如果没有,我想将数据存储在memcached中,因为当前存储的数据还不太大,它主要是出于内存迭代计算和流支持的目的,因此我正在使用Spark.是否有可能将RDD数据加载到memcached中?我之所以问是因为我只能为Spark找到一个Mongo连接器,而找不到一个Memcached连接器.

任何帮助,尤其是特定的代码示例/链接,将不胜感激.

Any help and especially specific code examples/ links would be much appreciated.

谢谢.

推荐答案

您不能以这种方式直接查询RDD.将您的Spark作业视为流处理器.您可以做的是将更新的模型推送到某个存储",例如数据库(使用自定义API或JDBC),文件系统或内存缓存.您甚至可以从Spark代码中进行Web服务调用.

You can't query an RDD directly in this way. Think of your Spark job as a stream processor. What you can do is push the updated model to some "store", such as a database (with a custom API or JDBC), a file system, or memcached. You could even make a web service call from within the Spark code.

无论您做什么,请注意处理每批数据(包括I/O)的时间都在您指定的间隔时间之内.否则,您将面临可能最终崩溃的瓶颈.

Whatever you do, be careful that the time to process each batch of data, including I/O, is well under the interval time you specify. Otherwise, you risk bottlenecks that might eventually crash.

要注意的另一件事是,您的模型数据位于分布在整个群集中的多个RDD分区中(当然,这是默认设置).如果记录"的顺序无关紧要,那么将它们并行写出就可以了.如果您需要按顺序写出特定的总订单(并且数据确实不大),请调用collect将其带入驱动程序代码中的一个内存数据结构(这将意味着分布式作业中的网络流量) ,然后从那里写.

One other thing to watch for is the case where you have your model data in more than one RDD partition spread over the cluster, (which is the default of course). If the order of your "records" doesn't matter, then writing them out in parallel is fine. If you need a specific total order written out sequentially (and the data really isn't large), call collect to bring them into one in-memory data structure inside your driver code (which will mean network traffic in a distributed job), then write from there.

这篇关于将Spark数据加载到Mongo/Memcached中,以供Web服务使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆