在python3中使用sys.stdin读取不同编码格式的文件 [英] read files with different encoding format using sys.stdin in python3
问题描述
我有很多用 UTF-8 或 GBK 编码的文件.我的系统编码是 UTF-8 (LANG=zh_CN.UTF-8
),所以我可以轻松读取用 UTF-8 编码的文件.但我也必须用 GBK 读取文件编码.我正在关注 Python 3:如何指定标准输入编码:
I have many files which are encoded with UTF-8 or GBK. My system encoding is UTF-8 (LANG=zh_CN.UTF-8
), so I can read files encoded with UTF-8 easily. But I must read file encoding with GBK as well. I'm following Python 3: How to specify stdin encoding here:
import sys
import io
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
print(line)
我的问题是如何从 sys.stdin
安全地读取所有文件(GBK 和 UTF-8).或者你能给我一些更好的解决方案吗?
My question is how can I read all the files (both GBK and UTF-8) safely from sys.stdin
. Or can you give me some better solution?
稍微扩展一下这个问题,我想处理这样的文件:
To slightly expand on this question, I want to handle files like this:
cat *.in | python3 handler.py
*.in
返回许多使用 UTF-8 或 GBK 编码的文件.
*.in
returns many files encoded with either UTF-8 or GBK.
如果我在 handler.py
for line in sys.stdin:
...some code
它会在尝试处理 GBK 文件时立即抛出错误:
it will throw an error as soon as it tries to process a GBK file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 0: invalid continuation byte
另一方面,如果我使用这样的代码:
On the other hand, if I use code like this:
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='gbk')
for line in input_stream:
...some code
它会在任何 UTF-8 文件上抛出错误:
it will throw an error on any UTF-8 file:
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 25: illegal multibyte sequence
我想找到一种安全的方法来处理我的脚本中的两种类型的文件(UTF-8 和 GBK).
I want to find a safe way to handle both types of files (UTF-8 and GBK) within my script.
推荐答案
您可以将输入作为原始字节读取,然后检查输入以决定实际将其解码成什么.
You can read the input as raw bytes, and then examine the input to decide what to actually decode it into.
假设您可以一次读取整行(即整行的编码可以预期是一致的),我会尝试解码为 utf-8,然后回退到 gbk.
Assuming you can read entire lines at a time (i.e. the encoding for an entire line can be expected to be consistent), I'd try to decode as utf-8, then fall back to gbk.
for raw_line in input_stream:
try:
line = raw_line.decode('utf-8')
except UnicodeDecodeError:
line = raw_line.decode('gbk')
# ...
这篇关于在python3中使用sys.stdin读取不同编码格式的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!