带有Tesseract OCR的UnicodeDecodeError'charmap'编解码器(Python) [英] UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

查看:171
本文介绍了带有Tesseract OCR的UnicodeDecodeError'charmap'编解码器(Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用teseract-OCR在python中的图像文件上执行OCR。
我的环境是-
Windows Machine上的Python 3.5 Anaconda。

I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine.

这是代码:

from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))

我遇到的错误是:

File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>

我已经尝试过在这里
黑客无法正常工作

I have tried the solution mentioned here The hack is not working

我尝试了Mac OS上的代码正常工作。

I have tried my code on Mac OS it is working.

我调查了pytesseract问题:
这是一个未解决问题

I have looked into the pytesseract issues: Here is this an open issue

谢谢

推荐答案

嗯。.这很奇怪-
当我们谈论 latin1文本编码时,字符 \x81不可打印。但是,在库使用的 cp1252编码上,它映射为一个未定义字符,这是明确的。

Hmm..something very weird going on there - The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.

发生的事情是 latin1是某种无操作编解码器,有时在Python中用于将字节序列简单地转换为unicode字符串(Python 3.x中的默认字符串)。编解码器 cp1252几乎完全相同,并且在某些情况下可以与latin1互换使用-但是此 \x81代码是两者之间的一个区别。在您的情况下,这是至关重要的。

What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.

正确的做法是尝试为 image_to_string 函数提供可选的 lang 参数-这样它就可以使用正确的编解码器来解码您的文本-如果它可以更好地识别它公开为 0x81的字符。但是,这可能不起作用-因为它可能是与根本不相关的非常奇怪的字符的OCR错误。

The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.

因此,为您解决的方法是猴子补丁 cp1252编解码器,以便它代替一个错误,而是填充Unicode无法识别字符-一种方法是在调用tesseract之前对这些行进行打扰:

So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:

from encodings import cp1252
original_decode  = cp1252.Codec.decode
cp1252.Codec.decode =  lambda self, input, errors="replace": original_decode(self, input, errors)

但是请(如果可以)打开针对pytesseract项目的错误报告。我的猜测是,此时他们应该使用 latin1而不是 cp1252编码。

But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.

这篇关于带有Tesseract OCR的UnicodeDecodeError'charmap'编解码器(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆