使用 AWS Textract 处理 PDF [英] Using AWS Textract for processing PDF
问题描述
我想使用 Textract OCR 服务从 pdf 文件中读取文本.我有一个问题,因为我想在没有 S3 存储桶的情况下在本地进行.我针对图像文件对其进行了测试,效果很好,但不适用于 PDF 文件.
I want to use Textract OCR service for reading text from pdf file. I have a problem with that because I want to do it locally, without S3 bucket. I tested it for image files and it works good, but it does not work for PDF files.
这是我收到错误的代码:
This is the code where I get an error:
response = textract.start_document_text_detection(DocumentLocation="sample2.pdf")
错误:
Invalid type for parameter DocumentLocation, value: sample2.pdf, type: <class 'str'>, valid types: <class 'dict'>
代码 2:
response = textract.start_document_text_detection(DocumentLocation={"name":"sample2.pdf"})
错误:
Unknown parameter in DocumentLocation: "name", must be one of: S3Object
代码 3:
response = textract.start_document_text_detection(Document={'Bytes': "sample2.pdf"})
错误:
Unknown parameter in input: "Document", must be one of: DocumentLocation, ClientRequestToken, JobTag, NotificationChannel, OutputConfig
我该怎么办,有没有办法让 Textract 在没有 s3 的情况下为 PDF 文档工作?
What should I do, Is there a way to make Textract work for PDF documents without s3?
推荐答案
对您的问题的简短回答是否"
The short answer to your question is "No."
Textract 仅适用于 S3 输入.y\您将需要遵循此处 boto3 文档中为该服务描述的预期输入格式:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection
Textract works with S3 only for input. y\You will need to follow the format for the expected input which is described for the service in the boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection
本质上,服务需要结构化输入,您需要根据他们的规范正确填写.这是 boto3 期望的 DocumentLocation 字典输入.
Essentially, the service wants a structured input and you need to fill that in correctly according to their specification. Here's the DocumentLocation dictionary input expected by boto3.
DocumentLocation={
'S3Object': {
'Bucket': 'string',
'Name': 'string',
'Version': 'string'
}
}
我目前在 boto3 中也遇到了一些类似的问题,但我会继续研究文档,看看我能找到什么.
I'm having some similar issues getting this to work in boto3 currently as well, but i will keep working thru the docs to see what i can figure out.
这篇关于使用 AWS Textract 处理 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!