如何读取多种已知的文件编码 [英] How to read multiple known file encodings

查看:43
本文介绍了如何读取多种已知的文件编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在网上寻找一种解决方案,以解决读取具有不同编码格式的文件的问题,并且我发现了许多无法分辨文件编码是什么"的实例(因此,如果有人正在阅读此文件并具有链接,我会很感激).但是,我要处理的问题比打开任何文件编码"要集中得多,而要打开一组已知的编码.我绝不是这个主题的专家,但我想我应该发布解决方案,以防有人遇到此问题.

I've been searching the web for a solution to address reading files with different encodings and I've found many instances of "it's impossible to tell what encoding a file is" (so if anyone is reading this and has a link I would appreciate it). However, the problem I was dealing with was a bit more focused than "open any file encoding" but rather open a set of known encodings. I am by no means an expert at this topic but I thought I would post my solution in case anyone ran into this issue.

具体示例:

已知的文件编码:utf8和Windows ansi

Known file encodings: utf8, and windows ansi

初始问题:据我所知,未为python的 open('file','r')命令指定编码自动默认为encoding ='utf8',这在运行时引发UnicodeDecodeError尝试 f.readline() ansi文件.对此的常见搜索是:"UnicodeDecodeError:'utf-8'编解码器无法解码字节"

Initial Issue: as I now know, not specifying a encoding to python's open('file', 'r') command auto defaults to encoding='utf8' That raised a UnicodeDecodeError at runtime when trying to f.readline() a ansi file. A common search on this is: "UnicodeDecodeError: 'utf-8' codec can't decode byte"

次要问题:所以我想很好,很简单,我们知道正在引发的异常,因此请读一行,如果它引发此UnicodeDecodeError,则关闭文件并使用 open('file',重新打开它,'r',encoding ='ansi').这样做的问题是,有时utf8能够很好地读取ansi编码文件的前几行,但随后却无法读取.现在解决方案变得清晰了.我必须用utf8读取整个文件,如果失败了,那我就知道该文件是ansi.

Secondary Issue: so then I thought okay, well simple enough, we know the exception that's being raised so read a line and if it raises this UnicodeDecodeError then close the file and reopen it with open('file', 'r', encoding='ansi'). The problem with this was that sometimes utf8 was able to read the first few lines of an ansi encoded file just fine but then failed on a later line. Now the solution became clear; I had to read through the entire file with utf8 and if it failed then I knew that this file was a ansi.

我将以此为答,但如果有人有更好的解决方案,我也将不胜感激:)

I'll post my take on this as an answer but if someone has a better solution, I would also appreciate that :)

推荐答案

f = open(path, 'r', encoding='utf8')
while True:
    try:
        line = f.readline()
    except UnicodeDecodeError:
        f.close()
        encodeing = 'ansi'
        break
    if not line:
        f.close()
        encoding = 'utf8'
        break

# now open your file for actual reading and data handling
with open(path, 'r', encoding=encoding) as f:

这篇关于如何读取多种已知的文件编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆