'utf-8'编解码器无法解码字节0xe2:无效的连续字节错误 [英] 'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

查看:1172
本文介绍了'utf-8'编解码器无法解码字节0xe2:无效的连续字节错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从文件夹中读取所有PDF文件,以使用正则表达式查找数字.经检查,PDF的字符集为"UTF-8".

I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.

引发此错误:

'utf-8'编解码器无法解码位置10的字节0xe2:无效 连续字节

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

尝试以二进制模式读取, 尝试使用Latin-1编码,但它会显示所有特殊字符,因此搜索中不会显示任何内容.

Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.

import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:\n(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

期望从PDF文件中查找API编号

Expecting to find API number from PDF files

推荐答案

使用open(..., 'r', encoding='utf-8')打开文件时,基本上可以保证这是一个文本文件,其中不包含不是UTF的字节-8.但是,当然,此保证不能用于PDF文件-它是一种二进制格式,可能会或可能不会在 -8中包含字符串.但这不是您阅读的方式.

When you open a file with open(..., 'r', encoding='utf-8') you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.

如果您有权访问读取PDF并提取文本字符串的库,则可以

If you have access to a library which reads PDF and extracts text strings, you could do

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:\n' in text_snippet:
        api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()

更现实的是,但以一种更加行人的方式,您可以将PDF文件读取为二进制文件,并查找编码的文本.

More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.

with open('file.pdf', 'rb') as pdf:
    pdfbytes = pdf.read()
if b'API No.:\n' in pdfbytes:
    api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8')
    api = api_text.split('"')[0].strip()

一个粗略的解决方法是对Python进行编码谎言,并声称它实际上是Latin-1.这种特殊的编码具有吸引人的功能,即每个字节都精确地映射到其自己的Unicode代码点,因此您可以将二进制数据读取为文本并摆脱它.但是,然后,当然,任何实际的UTF-8都将转换为 mojibake (因此将呈现为"hëlló").通过将文本转换回字节,然后以正确的编码(latintext.encode('latin-1').decode('utf-8'))对其进行解码,可以提取实际的UTF-8文本.

A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló" will render as "hëlló" for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')).

这篇关于'utf-8'编解码器无法解码字节0xe2:无效的连续字节错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆