AWS Textract-UnsupportedDocumentException [英] AWS textract - UnsupportedDocumentException

查看:136
本文介绍了AWS Textract-UnsupportedDocumentException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用boto3 for python实现aws textract时.

While implementing aws textract using boto3 for python.

代码:

import boto3

# Document
documentName = "/home/niranjan/IdeaProjects/amazon-forecast-samples/notebooks/basic/Tutorial/cert.pdf"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

print(type(imageBytes))

# Amazon Textract client
textract = boto3.client('textract', region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

以下是AWS的凭据和配置文件

below are credential and config files of aws

niranjan@niranjan:~$ cat ~/.aws/credentials
[default]
aws_access_key_id=my_access_key_id
aws_secret_access_key=my_secret_access_key

niranjan@niranjan:~$ cat ~/.aws/config 
[default]
region=eu-west-1

我遇到此异常:

---------------------------------------------------------------------------
UnsupportedDocumentException              Traceback (most recent call last)
<ipython-input-11-f52c10e3f3db> in <module>
     14 
     15 # Call Amazon Textract
---> 16 response = textract.detect_document_text(Document={'Bytes': imageBytes})
     17 
     18 #print(response)

~/venv/lib/python3.7/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/venv/lib/python3.7/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

UnsupportedDocumentException: An error occurred (UnsupportedDocumentException) when calling the DetectDocumentText operation: Request has unsupported document format

我对AWS textract还是陌生的,任何帮助将不胜感激.

I am bit new to AWS textract, any help would be much appreciated.

推荐答案

由于Textract的 DetectDocumentText API不支持"pdf"类型的文档,您在发送pdf时遇到 UnsupportedDocumentFormat异常.尝试改为发送图像文件.

As DetectDocumentText API of Textract does not support "pdf" type of document, sending pdf you encounter UnsupportedDocumentFormat Exception. Try to send image file instead.

如果仍然要发送pdf文件,则必须使用Textract的异步API.例如. StartDocumentAnalysis API可以启动分析,而 GetDocumentAnalysis 可以获取分析的文档.

Incase if you still want to send pdf file then you have to use Asynchronous APIs of Textract. E.g. StartDocumentAnalysis API to start analysis and GetDocumentAnalysis to get analyzed document.

检测输入文档中的文本.Amazon Textract可以检测文本行以及组成文本行的单词.输入的文档必须是JPEG或PNG格式的图像.DetectDocumentText返回在Block对象数组中检测到的文本.

Detects text in the input document. Amazon Textract can detect lines of text and the words that make up a line of text. The input document must be an image in JPEG or PNG format. DetectDocumentText returns the detected text in an array of Block objects.

https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html

这篇关于AWS Textract-UnsupportedDocumentException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆