如何将Amazon Textract与PDF文件一起使用 [英] How to use the Amazon Textract with PDF files

查看:228
本文介绍了如何将Amazon Textract与PDF文件一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经可以使用textract但可以使用JPEG文件.我想将其与PDF文件一起使用.

I already can use the textract but with JPEG files. I would like to use it with PDF files.

我有下面的代码:

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)

正如我所说,它工作正常.但是我想像在Web应用程序中一样使用它传递PDF文件进行测试.

As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.

我知道可以在python中将PDF转换为JPEG,但使用PDF会很好.我阅读了文档,但找不到答案.

I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.

我该怎么做?

我忘了提到我不打算使用de s3存储桶.我想直接在脚本中传递PDF,而不必将其上传到s3存储桶中.

EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.

推荐答案

如@syumaK所述,您需要先将pdf上传到S3.但是,这样做可能比您想象的更便宜,更容易:

As @syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:

  • 在控制台中创建新的S3存储桶并写下存储桶名称, 然后
  • Create new S3 bucket in console and write down bucket name, then
import random
import boto3

bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'

s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)

client = boto3.client('textract')
response = client.start_document_text_detection(
                   DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
                   ClientRequestToken=random.randint(1,1e10))

response = client.get_document_text_detection(JobId=jobid)

可能需要5到50秒,直到调用get_document_text_detection(...)返回结果.以前,它会说它仍在处理中.

It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.

根据我的理解,对于每个令牌,将只执行一次付费的API调用-如果令牌已出现在过去,则将检索过去的一个.

According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.

我忘了提一下,如果文档很大,那就太复杂了,在这种情况下,可能需要将结果从多个页面"中缝合起来.您将需要添加的代码种类是

I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is


...
pages = [response]
while nextToken := response.get('NextToken'):
    response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
    pages.append(response)
    

这篇关于如何将Amazon Textract与PDF文件一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆