在python3中使用sys.stdin读取不同编码格式的文件 [英] read files with different encoding format using sys.stdin in python3

查看:46
本文介绍了在python3中使用sys.stdin读取不同编码格式的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多用 UTF-8 或 GBK 编码的文件.我的系统编码是 UTF-8 (LANG=zh_CN.UTF-8),所以我可以轻松读取用 UTF-8 编码的文件.但我也必须用 GBK 读取文件编码.我正在关注 Python 3:如何指定标准输入编码:

I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:

import sys 
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    print(line)

我的问题是如何从 sys.stdin 安全地读取所有文件(GBK 和 UTF-8).或者你能给我一些更好的解决方案吗?

My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin. Or can you give me some better solution?

稍微扩展一下这个问题,我想处理这样的文件:

To slightly expand on this question, I want to handle files like this:

cat *.in | python3 handler.py

*.in 返回许多使用 UTF-8 或 GBK 编码的文件.

*.in returns many files encoded with either UTF-8 or GBK.

如果我在 handler.py

for line in sys.stdin:
    ...some code

它会在尝试处理 GBK 文件时立即抛出错误:

it will throw an error as soon as it tries to process a GBK file:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte

另一方面,如果我使用这样的代码:

On the other hand, if I use code like this:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
    ...some code

它会在任何 UTF-8 文件上抛出错误:

it will throw an error on any UTF-8 file:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence

我想找到一种安全的方法来处理我的脚本中的两种类型的文件(UTF-8 和 GBK).

I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.

推荐答案

您可以将输入作为原始字节读取,然后检查输入以决定实际将其解码成什么.

You can read the input as raw bytes, and then examine the input to decide what to actually decode it into.

另见从标准输入读取二进制数据

假设您可以一次读取整行(即整行的编码可以预期是一致的),我会尝试解码为 utf-8,然后回退到 gbk.

Assuming you can read entire lines at a time (i.e. the encoding for an entire line can be expected to be consistent), I'd try to decode as utf-8, then fall back to gbk.

for raw_line in input_stream:
    try:
        line = raw_line.decode('utf-8')
    except UnicodeDecodeError:
        line = raw_line.decode('gbk')
    # ...

这篇关于在python3中使用sys.stdin读取不同编码格式的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆