使用 AWS Textract 处理 PDF [英] Using AWS Textract for processing PDF

查看:87
本文介绍了使用 AWS Textract 处理 PDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Textract OCR 服务从 pdf 文件中读取文本.我有一个问题,因为我想在没有 S3 存储桶的情况下在本地进行.我针对图像文件对其进行了测试,效果很好,但不适用于 PDF 文件.

I want to use Textract OCR service for reading text from pdf file. I have a problem with that because I want to do it locally, without S3 bucket. I tested it for image files and it works good, but it does not work for PDF files.

这是我收到错误的代码:

This is the code where I get an error:

response = textract.start_document_text_detection(DocumentLocation="sample2.pdf")

错误:

Invalid type for parameter DocumentLocation, value: sample2.pdf, type: <class 'str'>, valid types: <class 'dict'>

代码 2:

response = textract.start_document_text_detection(DocumentLocation={"name":"sample2.pdf"})

错误:

Unknown parameter in DocumentLocation: "name", must be one of: S3Object

代码 3:

response = textract.start_document_text_detection(Document={'Bytes': "sample2.pdf"})

错误:

Unknown parameter in input: "Document", must be one of: DocumentLocation, ClientRequestToken, JobTag, NotificationChannel, OutputConfig

我该怎么办,有没有办法让 Textract 在没有 s3 的情况下为 PDF 文档工作?

What should I do, Is there a way to make Textract work for PDF documents without s3?

推荐答案

对您的问题的简短回答是否"

The short answer to your question is "No."

Textract 仅适用于 S3 输入.y\您将需要遵循此处 boto3 文档中为该服务描述的预期输入格式:https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

Textract works with S3 only for input. y\You will need to follow the format for the expected input which is described for the service in the boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

本质上,服务需要结构化输入,您需要根据他们的规范正确填写.这是 boto3 期望的 DocumentLocation 字典输入.

Essentially, the service wants a structured input and you need to fill that in correctly according to their specification. Here's the DocumentLocation dictionary input expected by boto3.

DocumentLocation={
    'S3Object': {
        'Bucket': 'string',
        'Name': 'string',
        'Version': 'string'
    }
}

我目前在 boto3 中也遇到了一些类似的问题,但我会继续研究文档,看看我能找到什么.

I'm having some similar issues getting this to work in boto3 currently as well, but i will keep working thru the docs to see what i can figure out.

这篇关于使用 AWS Textract 处理 PDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆