提高 Python Tesseract OCR 的准确性 [英] Improving accuracy in Python Tesseract OCR

查看:57
本文介绍了提高 Python Tesseract OCR 的准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用

我附上了提取文本的屏幕截图,下面用红色下划线标出了错误的单词.请注意,此处不保留空格和缩进.提取的文字截图为:

在上面的代码片段中,我使用以下代码行完成了图像处理:

gray = get_grayscale(image)阈值 = 阈值(灰色)开口 1 = 开口(灰色)canny1 = canny(灰色)

之后,我在以下行中将处理过的图像输入到 tessect 中:

content = pytesseract.image_to_string( image, lang = 'eng+ben')

但我的困惑点是我在处理之前或之后都没有保存图像.所以当我使用上面这行时,我不确定是处理过还是未处理过的图像提供给 tessect 引擎.

Q1) 我是否需要在处理后保存图像,然后将其提供给 tesserect 引擎?如果是,该怎么做?

Q2) 我还应该采取哪些步骤来提高准确性?

注意:即使您不熟悉孟加拉语,我认为这不会有任何问题,因为您可以查看红色下划线的单词并进行比较.

TL;DR:可以直接查看view.pyurls.py文件中的代码,去掉模板代码,方便理解.

解决方案

Q1) 无需保存图像.图像存储在您的变量 image

Q2) 您实际上并未对应用于的图像后处理函数(即变量 canny1)进行 OCR.下面的代码将依次对图像进行处理,然后对存储在canny1中的后处理图像应用OCR.

gray = get_grayscale(image)阈值 = 阈值(灰色)开场1 =开场(阈值)canny1 = canny(opening1)content = pytesseract.image_to_string(canny1, lang = 'eng+ben')

I am using pytesseract along with openCV in a simple django application in Python to extract text in Bengali language from image files. I have a form that lets you upload an image and on clicking the submit button sends it to the server side in an ajax call in jQuery to extract the text from the image to serve the purpose of OCR (Optical Character Recognition).

Template part :

 <div style="text-align: center;">
 <div id="result" class="text-center"></div>
    <form enctype="multipart/form-data" id="ocrForm" action="{% url 'process_image' %}" method="post"> <!-- Do not forget to add: enctype="multipart/form-data" -->
        {% csrf_token %}
        {{ form }}
        <button type="submit" class="btn btn-success">OCRzed</button>
    </form>

    <br><br><hr>
    <div id="content" style="width: 50%; margin: 0 auto;">
        
    </div>
</div>


<script type="text/javascript">




 $(document).ready(function(){ 
        function submitFile(){
            var fd = new FormData();
            fd.append('file', getFile())
            $("#result").html('<span class="wait">Please wait....</span>');

            $('#content').html('');
            $.ajax({
                url: "{% url 'process_image' %}",
                type: "POST",
                data: fd,
                processData: false,
                contentType: false,
                success: function(data){
                    // console.log(data.content);

            $("#result").html('');

                    if(data.content){
                        $('#content').html(
                            "<p>" + data.content + "</p>"
                        )
                    }  
                }
            })
        }

        function getFile(){
            var fp = $("#file_id")
            var item = fp[0].files
            return item[0]
        }

        // Submit the file for OCRization
        $("#ocrForm").on('submit', function(event){
            event.preventDefault();
            submitFile()
        })
    });






</script>

The urls.py file has:

from django.urls import path, re_path
from .views import *

urlpatterns = [
 path('process_image', OcrView.process_image, name='process_image') ,
]

The view part :

from django.contrib.auth.models import User
from django.shortcuts  import render, redirect, get_object_or_404
from .forms import NewTopicForm
from .models import Board, Topic, Post
from django.shortcuts import render
from django.http import HttpResponse
from django.http import Http404
    
from django.http import JsonResponse
from django.views.generic import FormView
    
from django.views.decorators.csrf import csrf_exempt
import json
import cv2
import numpy as np
    
import pytesseract    # ======= > Add
try:
     from PIL import Image
except:
        import Image

def ocr(request):
    return render(request, 'ocr.html')
    #    {'board': board,'form':form})    

# get grayscale image
def get_grayscale(image):
         return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# noise removal
def remove_noise(image):
         return cv2.medianBlur(image,5)
 
#thresholding
def thresholding(image):
         return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

#dilation
def dilate(image):
         kernel = np.ones((5,5),np.uint8)
         return cv2.dilate(image, kernel, iterations = 1)
    
#erosion
def erode(image):
       kernel = np.ones((5,5),np.uint8)
       return cv2.erode(image, kernel, iterations = 1)

#opening - erosion followed by dilation
def opening(image):
        kernel = np.ones((5,5),np.uint8)
        return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)

#canny edge detection
def canny(image):
        return cv2.Canny(image, 100, 200)

#skew correction
def deskew(image):
       coords = np.column_stack(np.where(image > 0))
       angle = cv2.minAreaRect(coords)[-1]
       if angle < -45:
         angle = -(90 + angle)
       else:
         angle = -angle
       (h, w) = image.shape[:2]
       center = (w // 2, h // 2)
       M = cv2.getRotationMatrix2D(center, angle, 1.0)
       rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
       return rotated

#template matching
def match_template(image, template):
       return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
 
class OcrView(FormView):
    form_class = UploadForm
    template_name = 'ocr.html'
    success_url = '/'

    
    @csrf_exempt
    def process_image(request):
        if request.method == 'POST':
          response_data = {}
          upload = request.FILES['file']
        
        filestr = request.FILES['file'].read()
        #convert string data to numpy array
        npimg = np.fromstring(filestr, np.uint8)
        image = cv2.imdecode(npimg, cv2.IMREAD_UNCHANGED)

        # image=Image.open(upload)
        gray = get_grayscale(image)
        thresh = thresholding(gray)
        opening1 = opening(gray)
        canny1 = canny(gray)
       
        pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCR	esseract.exe'
        # content = pytesseract.image_to_string(Image.open(upload), lang = 'ben')

        # content = pytesseract.image_to_string( image, lang = 'ben')

        content = pytesseract.image_to_string( image, lang = 'eng+ben')

        #   data_ben = process_image("test_ben.png", "ben")
        response_data['content'] = content

        return JsonResponse(response_data)

I am attaching a sample image just below here which when I give as the input file, the extracted text I get from there is not up to any satisfactory level of accuracy. The input image is:

I am attaching a screenshot of the extracted text with wrong words underlined in red below. Note that the spaces and indentations are not preserved there. The screenshot of extracted text is :

In the above code snippet, I have done the image processing with the following code lines:

gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(gray)
canny1 = canny(gray)

After that I have fed tesserect with the processed image in the following line:

content = pytesseract.image_to_string( image, lang = 'eng+ben')

But my point of confusion is that I have nowhere saved the image before or after processing. So when I use the above line , I am not sure whether the processed or unprocessed image is supplied to tesserect engine.

Q1) Do I need to save the image after processing it and then supply it to the tesserect engine ? If yes , how to do that ?

Q2) What else steps should I take to improve the accuracy ?

NB: Even if you are not familiar with Bengali language, I think this wont be any problem as you can just look at the red-underlined words and make a comparison.

EDIT:

TL;DR: You can just look at the code in view.py and urls.py files and exclude the template code for the sake of understanding easily.

解决方案

Q1) No need to save the image. The image is stored in your variable image

Q2) You are not actually doing OCR on the image post-processing functions applied to, i.e. variable canny1. The below code would successively perform the processing steps on image and then apply OCR to the post-processed image stored in canny1.

gray = get_grayscale(image)
thresh = thresholding(gray)
opening1 = opening(thresh )
canny1 = canny(opening1 )

content = pytesseract.image_to_string( canny1 , lang = 'eng+ben')

这篇关于提高 Python Tesseract OCR 的准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆