使用Spark处理请求 [英] Using Spark to process requests

查看:36
本文介绍了使用Spark处理请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解以下内容是否是Spark的正确用例.

I would like to understand if the following would be a correct use case for Spark.

在消息队列或包含一批请求的文件中接收到对应用程序的请求.对于消息队列,当前每秒大约有100个请求,尽管这可能会增加.一些文件仅包含一些请求,但更多情况下有数百个甚至数千个.

Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.

每个请求的处理包括请求过滤,验证,查找参考数据和计算.一些计算引用了规则引擎.这些完成后,新消息将发送到下游系统.

Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.

我们想使用Spark在多个节点之间分配处理,以获得可伸缩性,弹性和性能.

We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.

我设想它将像这样工作:

I am envisaging that it would work like this:

  1. 以RDD的形式将一批请求加载到Spark中(消息队列上接收到的请求可能使用Spark Streaming).
  2. 将编写单独的Scala函数以进行过滤,验证,参考数据查找和数据计算.
  3. 第一个函数将传递给RDD,并返回一个新的RDD.
  4. 接下来的功能将针对前一个功能的RDD输出运行.
  5. 所有功能完成后,将针对最终的RDD运行for循环理解,以将每个修改后的请求发送到下游系统.

以上听起来是否正确,或者这不是使用Spark的正确方法?

Does the above sound correct, or would this not be the right way to use Spark?

谢谢

推荐答案

我们在一个小的IOT项目上做了类似的工作.我们测试了在3个节点上每秒接收和处理约50K mqtt消息的过程,这很容易.我们的处理包括解析每个JSON消息,对创建的对象进行某种处理并将所有记录保存到时间序列数据库中.我们将批处理时间定义为1秒,处理时间约为300ms,RAM约为100sKB.与流媒体有关的一些问题.确保您的下游系统是异步的,这样您就不会陷入内存问题.火花支持背压的事实是正确的,但您需要实现它.另一件事,请尝试使状态保持最小.更具体地说,您不应保持任何状态随输入的增长而线性增长.这对于您的系统可伸缩性非常重要.

We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database. We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB. A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.

最让我印象深刻的是,使用火花进行缩放的难易程度.随着我们添加的每个节点,我们可以处理的消息的频率呈线性增长.

what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.

我希望这会有所帮助.祝你好运

I hope this helps a little. Good luck

这篇关于使用Spark处理请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆