在Python中处理UTF-8数字 [英] Dealing with UTF-8 numbers in Python
问题描述
假设我正在读一个包含3个逗号分隔的数字的文件。该文件与未知编码保存,到目前为止我处理的ANSI和UTF-8。如果文件是UTF-8,它有1行的值为115,113,12,则:
Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:
with open(file) as f:
a,b,c=map(int,f.readline().split(','))
会抛出:
invalid literal for int() with base 10: '\xef\xbb\xbf115'
第一个数字总是与这些'\xef\xbb \xbf'个字符。对于剩下的2个数字,转换工作正常。如果我用''手动替换'\xef\xbb\xbf',然后做int转换,它会工作。
The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.
有更好的方法这是为任何类型的编码文件?
Is there a better way of doing this for any type of encoded file?
推荐答案
import codecs
with codecs.open(file, "r", "utf-8-sig") as f:
a, b, c= map(int, f.readline().split(","))
这适用于Python 2.6.4。 codecs.open
调用打开文件并以unicode返回数据,从UTF-8解码并忽略初始BOM。
This works in Python 2.6.4. The codecs.open
call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.
这篇关于在Python中处理UTF-8数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!