Spark-缺少1个必需的位置参数(lambda函数) [英] Spark - missing 1 required position argument (lambda function)

查看:71
本文介绍了Spark-缺少1个必需的位置参数(lambda函数)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Spark在多个服务器之间分发一些PDF文本提取.这是使用我制作的自定义Python模块,并且是此问题的实现."extractTextFromPdf"函数采用2个参数:代表文件路径的字符串和用于确定各种提取约束的配置文件.在这种情况下,配置文件只是一个简单的YAML文件,与运行提取程序的Python脚本位于同一文件夹中,并且这些文件只是在Spark服务器之间重复.

I'm trying to distribute some text extraction from PDFs between multiple servers using Spark. This is using a custom Python module I made and is an implementation of this question. The 'extractTextFromPdf' function takes 2 arguments: a string representing the path to the file, and a configuration file used to determine various extraction constraints. In this case the config file is just a simple YAML file sitting in the same folder as the Python script running the extraction and the files are just duplicated between Spark servers.

我遇到的主要问题是能够使用文件名作为第一个参数而不是文件的内容来调用我的提取函数.这是我到目前为止拥有的基本脚本,可以在 files 文件夹中的2个PDF上运行它:

The main issue I have is being able to call my extract function using the filename as the first argument, rather than the file's content. This is the basic script I have as of now, running it on 2 PDFs in the files folder:

#!/usr/bin/env python3

import ScannedTextExtractor.STE as STE

from pyspark import SparkContext
sc = SparkContext("local", "STE")

input = sc.binaryFiles("/home/ubuntu/files")
processed = input.map(lambda filename, content: (STE.extractTextFromPdf(filename,'ste-config.yaml'), content))

print("Results:")
print(processed.take(2))

这会产生lambda错误缺少1个位置参数:"content" .我并不是很在意使用PDF的原始内容,并且由于提取函数的参数只是通往PDF的路径,而不是实际的PDF内容本身,因此我尝试为lambda函数仅提供1个参数.例如

This creates the lambda error Missing 1 position argument: 'content'. I don't really care about using the PDFs raw content and since the argument to my extraction function is just the path to the PDF, not the actual PDF content itself, I tried to just give 1 argument to the lambda function. e.g.

processed = input.map(lambda filename: STE.extractTextFromPdf(filename,'ste-config.yaml'))

但是然后我遇到了问题,因为在这种设置下,Spark将PDF内容(作为字节流)设置为此奇异参数,但是我的模块希望将以PDF路径为首个字符串而不是整个字节内容PDF.

But then I get issues because with this setup Spark sets the PDF content (as a byte stream) as this singular argument, but my module expects a string with the path to the PDF as the first arg, not the whole byte content of the PDF.

我打印了由SparkContext加载的二进制文件的RDD,我可以看到RDD中既有文件名,又有文件内容(PDF的字节流).但是,如何将其与需要以下snytax的自定义Python模块一起使用:

I printed the RDD of the binary file loading by the SparkContext and I can see that in there is both the filename and the file content (a byte stream of the PDF) in the RDD. But how do I use it with my custom Python module that expects the following snytax:

STE.extractTextFromPDF('/path/to/pdf','/path/to/config-file')

我尝试了lambda函数的多个排列,我对Spark的RDD和SparkContext API进行了三重检查.我似乎无法正常工作.

I've tried multiple permutations of the lambda function, I've triple checked Spark's RDD and SparkContext APIs. I can't seem to get it working.

推荐答案

如果只需要路径而不是内容,则不应使用 sc.binaryFiles .在这种情况下,您应该并行化路径,然后让Python代码分别加载每个文件,如下所示:

If you only want the path, not the content then you should not use sc.binaryFiles. In that case you should parallelize the paths and then have the Python code load each file individually, as so:

paths = ['/path/to/file1', '/path/to/file2']
input = sc.parallelize(paths)
processed = input.map(lambda path: (path, processFile(path)))

这当然假设每个执行器Python进程都可以直接访问文件.例如,这不适用于HDFS或S3.您的图书馆可以不直接获取二进制内容吗?

This assumes of course that each executor Python process can access the files directly. This wouldn't work with HDFS or S3 for instance. Can your library not take binary content directly?

这篇关于Spark-缺少1个必需的位置参数(lambda函数)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆