分发Python模块-Spark与进程池 [英] Distributing Python module - Spark vs Process Pools

查看:128
本文介绍了分发Python模块-Spark与进程池的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经制作了一个Python模块,该模块从PDF中提取手写文本.提取有时可能会很慢(每个文件20-30秒).我大约有100,000个PDF(有些页面很多),我想对所有这些文件都进行文本提取.本质上是这样的:

I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:

fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']

for pdf in fileList:
    text = myModule.extractText(pdf) # Distribute this function
    # Do stuff with text

我们以前曾经使用过Spark(一个同事,不是我),将SQL数据库中的数百万个文件的索引分布到数台服务器上的Solr中,但是当进行研究时,Spark似乎更多地用于并行化大型数据集,这么多地分配一个任务.为此,看起来Python内置的进程池"模块会更好,我可以在具有4个CPU内核的单个服务器上运行它.

We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.

我知道SO更适合特定问题,但是在我走完全错误的道路之前,我只是想寻求一些建议.就我的用例而言,我应该坚持使用带有进程池的单个服务器,还是应该使用Spark将其拆分为多个服务器?

I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?

推荐答案

使用Spark完全合理,因为您可以通过将文件放置在分布式存储上来在多个执行程序之间分配文本提取任务.这将使您可以扩展计算以处理文件,并使用pySpark轻松高效地写回结果.您甚至可以使用现有的Python文本提取代码:

This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:

input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))

随着数据量的增加或您希望增加吞吐量,您可以简单地添加其他节点.

As your data volume increases or you wish to increase your throughput you can simply add additional nodes.

这篇关于分发Python模块-Spark与进程池的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆