Pytesseract 提高 OCR 准确性 [英] Pytesseract Improve OCR Accuracy

查看:109
本文介绍了Pytesseract 提高 OCR 准确性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 python 中的图像中提取文本.为了做到这一点,我选择了 pytesseract.当我尝试从图像中提取文本时,结果并不令人满意.我还经历了

代码:

导入pytesseract导入 cv2将 numpy 导入为 npimg = cv2.imread('D:\\wordsimg.png')img = cv2.resize(img, 无, fx=1.2, fy=1.2, 插值=cv2.INTER_CUBIC)img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)内核 = np.ones((1,1), np.uint8)img = cv2.dilate(img,内核,迭代=1)img = cv2.erode(img,内核,迭代=1)img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'txt = pytesseract.image_to_string(img ,lang = 'eng')txt = txt[:-1]txt = txt.replace('\n',' ')打印(txt)

输出:

t 软管他大的形式可能会点亮另一个我们应该拿山上的房子n故事重要去了自己的想法女孩在家庭看一些多问下面为什么错过点让英里成长做自己的学校是

即使是 1 个不需要的空间也会花费我很多钱.我希望结果是 100% 准确的.任何帮助,将不胜感激.谢谢!

解决方案

我将 resize 从 1.2 更改为 2 并删除了所有预处理.我用 psm 11 和 psm 12 取得了不错的结果

导入pytesseract导入 cv2将 numpy 导入为 npimg = cv2.imread('wavy.png')# img = cv2.resize(img, 无, fx=1.2, fy=1.2, 插值=cv2.INTER_CUBIC)img = cv2.resize(img, None, fx=2, fy=2)img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)内核 = np.ones((1,1), np.uint8)# img = cv2.dilate(img, kernel, iterations=1)# img = cv2.erode(img, kernel, iterations=1)# img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]cv2.imwrite('thresh.png', img)pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'对于范围内的 psm(6,13+1):配置 = '--oem 3 --psm %d' % psmtxt = pytesseract.image_to_string(img, config = config, lang='eng')打印('psm',psm,':',txt)

config = '--oem 3 --psm %d' % psm 行使用 字符串插值 (%) 运算符 用整数 (psm) 替换 %d.我不太确定 oem 是做什么的,但我已经养成了使用它的习惯.更多关于 psm 在这个答案的末尾.

psm 11 : 那些他巨大的形式可能会点燃另一个我们应该命名的走山故事重要去了自己的想法女孩过家看看有些多问下为什么漏点让英里成长做自己的学校是psm 12 :那些他巨大的形式可能会点燃另一个我们应该命名走山故事重要去了自己的想法女孩过家看看有些多问下为什么漏点让英里成长做自己的学校是

psm 是页面分割模式的缩写.我不确定不同的模式是什么.您可以从描述中了解代码的含义.您可以从 tesseract --help-psm

获取列表

分页模式:0 仅限方向和脚本检测 (OSD).1 带有 OSD 的自动页面分割.2 自动页面分割,但没有 OSD 或 OCR.(未实现)3 全自动分页,但无OSD.(默认)4 假设有一列大小可变的文本.5 假设有一个统一的垂直对齐文本块.6 假设有一个统一的文本块.7 将图像视为单个文本行.8 将图像视为一个词.9 将图像视为圆圈中的单个单词.10 将图像视为单个字符.11 稀疏文本.查找尽可能多的文本,没有特定的顺序.12 带有 OSD 的稀疏文本.13 原始线.将图像视为单个文本行,绕过特定于 Tesseract 的黑客攻击.

I want to extract the text from an image in python. In order to do that, I have chosen pytesseract. When I tried extracting the text from the image, the results weren't satisfactory. I also went through this and implemented all the techniques listed down. Yet, it doesn't seem to perform well.

Image:

Code:

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

Output:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was 

Even 1 unwanted space could cost me a lot. I want the results to be 100% accurate. Any help would be appreciated. Thanks!

解决方案

I changed resize from 1.2 to 2 and removed all preprocessing. I got good results with psm 11 and psm 12

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

The config = '--oem 3 --psm %d' % psm line uses the string interpolation (%) operator to replace %d with an integer (psm). I'm not exactly sure what oem does, but I've gotten in the habit of using it. More on psm at the end of this answer.

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm is short for page segmentation mode. I'm not exactly sure what the different modes are. You can get a feel for what the codes are from the descriptions. You can get the list from tesseract --help-psm

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

这篇关于Pytesseract 提高 OCR 准确性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆