apache spark中的批处理API调用? [英] Batched API call inside apache spark?

查看:58
本文介绍了apache spark中的批处理API调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Apache Spark的初学者,确实有以下任务:

I am a beginner to Apache Spark and I do have the following task:

我正在从数据源读取记录,在Spark转换中,需要通过调用外部Web服务来增强数据,然后才能对其进行进一步处理.

I am reading records from a datasource that - within the spark transformations - need to be enhanced by data from a call to an external webservice before they can be processed any further.

Web服务将在一定程度上接受并行调用,但一次只允许发送几百条记录.而且,它非常慢,因此尽可能多地分批处理和并行请求绝对有助于解决问题.

The webservice will accept parallel calls to a certain extent, but only allows a few hundred records to be sent at once. Also, it's quite slow, so batching up as much as possible and parallel requests are definitely helping here.

有没有办法以合理的方式用spark做到这一点?

Is there are way to do this with spark in a reasonable manner?

我想到了读取记录,将它们预处理到另一个数据源,然后一次读取"API-Queue"数据源500条记录(如果可能的话,可以使用多个进程),然后将记录写入下一个数据源,然后使用此结果数据源进行最终转换.

I thought of reading records, pre-process them to another datasource, then read the "API-Queue" data source 500 records at a time (if possible with multiple processes) and write the records to the next datasource, and use this result datasource to do the final transformations.

唯一需要遵守这些怪异限制的地方是在API调用内(这就是为什么我认为某些中间数据格式/数据源将是合适的).

The only place where those weird limits need to be respected is within the API calls (that's why I thought some intermediate data format / data source would be appropriate).

您想指出我的任何想法或方向吗?

Any ideas or directions you want to point me to?

推荐答案

如果在RDD处理中调用外部API,则每个Spark执行程序将并行进行该调用.如果您考虑一下,那么这就是快速处理数据所需要的.

If you call your external API inside your RDD processing, the call will be made in parallel by each Spark executor. Which, if you think about it, is what you want for a fast processing of your data.

如果您要补偿API的呆滞,可以在您的端安装缓存服务器以处理重复的请求,例如memcache,例如: http://memcached.org/

If you want to compensate on your side for the sluggishness of the API, you can install a caching server on your side to deal with repeated requests, such as memcache, for example: http://memcached.org/

这篇关于apache spark中的批处理API调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆