在 Python 中使用 Tesseract OCR 的 UnicodeDecodeError [英] UnicodeDecodeError with Tesseract OCR in Python

查看:40
本文介绍了在 Python 中使用 Tesseract OCR 的 UnicodeDecodeError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Python 中的 Tesseract OCR 从图像文件中提取文本,但我遇到了一个错误,我可以弄清楚如何处理它.我所有的环境都很好,因为我用 python 中的 ocr 测试了一些示例图像!

Iam trying to extract text from an image file using Tesseract OCR in Python but I'am facing an Error that i can figure out how to deal with it. all my environment is good as i tested some sample image with the ocr in python!

这是代码

from PIL import Image
import pytesseract
strs = pytesseract.image_to_string(Image.open('binarized_image.png'))

print (strs)

以下是我从eclipse控制台得到的错误

the follow is the error I get from eclipse console

strs = pytesseract.image_to_string(Image.open('binarized_body.png'))
  File "C:\Python35x64\lib\site-packages\pytesseract\pytesseract.py", line 167, in image_to_string
    return f.read().strip()
  File "C:\Python35x64\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 20: character maps to <undefined>

我在 Windows10

推荐答案

问题是 python 试图使用控制台的编码 (CP1252) 而不是它打算使用的 (UTF-8).PyTesseract 找到了一个 unicode 字符,现在正试图将它翻译成 CP1252,但它无法做到.在另一个平台上,您不会遇到此错误,因为它将使用 UTF-8.

The problem is that python is trying to use the console's encoding (CP1252) instead of what it's meant to use (UTF-8). PyTesseract has found a unicode character and is now trying to translate it into CP1252, which it can't do. On another platform you won't encounter this error because it will get to use UTF-8.

您可以尝试使用不同的函数(可能返回 bytes 而不是 str 的函数,因此您不必担心编码).您可以更改其中一条评论中提到的 python 的默认编码,尽管当您尝试在 Windows 控制台上打印字符串时,这会导致问题.或者,这是我推荐的解决方案,您可以下载 Cygwin 并在其上运行 python 以获得干净的 UTF-8 输出.

You can try using a different function (possibly one that returns bytes instead of str so you won't have to worry about encoding). You could change the default encoding of python as mentioned in one of the comments, although that will cause problems when you go to try and print the string on the windows console. Or, and this is my recommended solution, you could download Cygwin and run python on that to get a clean UTF-8 output.

如果您想要一个不会破坏任何东西(目前)的快速而肮脏的解决方案,您可以考虑以下方法:

If you want a quick and dirty solution that won't break anything (yet), here's a way that you might consider:

import builtins

original_open = open
def bin_open(filename, mode='rb'):       # note, the default mode now opens in binary
    return original_open(filename, mode)

from PIL import Image
import pytesseract

img = Image.open('binarized_image.png')

try:
    builtins.open = bin_open
    bts = pytesseract.image_to_string(img)
finally:
    builtins.open = original_open

print(str(bts, 'cp1252', 'ignore'))

这篇关于在 Python 中使用 Tesseract OCR 的 UnicodeDecodeError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆