表单识别器速度问题 [英] Form Recognizer speed issues

查看:111
本文介绍了表单识别器速度问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用带有标签的自定义模型(使用示例标签工具创建),并从此

I'm using a custom model with labels (created with the sample labeling tool) and getting the results with the "Python Form Recognizer Async Analyze" V2 SDK Code from the bottom of this 1 page. It basicly works but it took over 20 seconds for a single page PDF file to get the results (6 labels used, S0 pricing model). 150 single page pdf files took over one hour. We also tested with the V1 SDK Preview Version (without labels) of the form recognizer which was significantly faster then V2.

我知道V2现在是异步的,但是有什么方法可以加快表单识别的速度吗? 下面是我基本使用的代码:

I know V2 is async now but is there anything which could be done to speed up form recognition? Below is the code i'm basicly using:

########### Python Form Recognizer Async Analyze #############
import json
import time
from requests import get, post

# Endpoint URL
endpoint = r"<endpoint>"
apim_key = "<subsription key>"
model_id = "<model_id>"
post_url = endpoint + "/formrecognizer/v2.0-preview/custom/models/%s/analyze" % model_id
source = r"<file path>"
params = {
    "includeTextDetails": True
}

headers = {
    # Request headers
    'Content-Type': '<file type>',
    'Ocp-Apim-Subscription-Key': apim_key,
}
with open(source, "rb") as f:
    data_bytes = f.read()

try:
    resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
    if resp.status_code != 202:
        print("POST analyze failed:\n%s" % json.dumps(resp.json()))
        quit()
    print("POST analyze succeeded:\n%s" % resp.headers)
    get_url = resp.headers["operation-location"]
except Exception as e:
    print("POST analyze failed:\n%s" % str(e))
    quit() 

n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET analyze results failed:\n%s" % json.dumps(resp_json))
            quit()
        status = resp_json["status"]
        if status == "succeeded":
            print("Analysis succeeded:\n%s" % json.dumps(resp_json))
            quit()
        if status == "failed":
            print("Analysis failed:\n%s" % json.dumps(resp_json))
            quit()
        # Analysis still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET analyze results failed:\n%s" % str(e)
        print(msg)
        quit()
print("Analyze operation did not complete within the allocated time.")

推荐答案

感谢您提出问题,我们正在调查此问题,并将在不久后向您更新.为了分析150个单页,您可以将所有页面并行发送到Form Recognizer,以减少时间.

Thanks for the question, we are investigating this issue and will update you shortly. For analyzing 150 single pages you can send all the pages in parallel to Form Recognizer to reduce the time.

这篇关于表单识别器速度问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆