带有Tesseract OCR的UnicodeDecodeError'charmap'编解码器(Python) [英] UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python
问题描述
我正在尝试使用teseract-OCR在python中的图像文件上执行OCR。
我的环境是-
Windows Machine上的Python 3.5 Anaconda。
I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine.
这是代码:
from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))
我遇到的错误是:
File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>
我已经尝试过在这里
黑客无法正常工作
I have tried the solution mentioned here The hack is not working
我尝试了Mac OS上的代码正常工作。
I have tried my code on Mac OS it is working.
我调查了pytesseract问题:
这是一个未解决问题
I have looked into the pytesseract issues: Here is this an open issue
谢谢
推荐答案
嗯。.这很奇怪-
当我们谈论 latin1文本编码时,字符 \x81不可打印。但是,在库使用的 cp1252编码上,它映射为一个未定义字符,这是明确的。
Hmm..something very weird going on there - The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.
发生的事情是 latin1是某种无操作编解码器,有时在Python中用于将字节序列简单地转换为unicode字符串(Python 3.x中的默认字符串)。编解码器 cp1252几乎完全相同,并且在某些情况下可以与latin1互换使用-但是此 \x81代码是两者之间的一个区别。在您的情况下,这是至关重要的。
What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.
正确的做法是尝试为 image_to_string
函数提供可选的 lang
参数-这样它就可以使用正确的编解码器来解码您的文本-如果它可以更好地识别它公开为 0x81的字符。但是,这可能不起作用-因为它可能是与根本不相关的非常奇怪的字符的OCR错误。
The correct thing to do there is try to supply the image_to_string
function with the optional lang
parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.
因此,为您解决的方法是猴子补丁 cp1252编解码器,以便它代替一个错误,而是填充Unicode无法识别字符-一种方法是在调用tesseract之前对这些行进行打扰:
So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:
from encodings import cp1252
original_decode = cp1252.Codec.decode
cp1252.Codec.decode = lambda self, input, errors="replace": original_decode(self, input, errors)
但是请(如果可以)打开针对pytesseract项目的错误报告。我的猜测是,此时他们应该使用 latin1而不是 cp1252编码。
But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.
这篇关于带有Tesseract OCR的UnicodeDecodeError'charmap'编解码器(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!