在Python中处理UTF-8数字 [英] Dealing with UTF-8 numbers in Python

查看:137
本文介绍了在Python中处理UTF-8数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我正在读一个包含3个逗号分隔的数字的文件。该文件与未知编码保存,到目前为止我处理的ANSI和UTF-8。如果文件是UTF-8,它有1行的值为115,113,12,则:

Suppose I am reading a file containing 3 comma separated numbers. The file was saved with with an unknown encoding, so far I am dealing with ANSI and UTF-8. If the file was in UTF-8 and it had 1 row with values 115,113,12 then:

with open(file) as f:
    a,b,c=map(int,f.readline().split(','))

会抛出:

invalid literal for int() with base 10: '\xef\xbb\xbf115'

第一个数字总是与这些'\xef\xbb \xbf'个字符。对于剩下的2个数字,转换工作正常。如果我用''手动替换'\xef\xbb\xbf',然后做int转换,它会工作。

The first number is always mangled with these '\xef\xbb\xbf' characters. For the rest 2 numbers the conversion works fine. If I manually replace '\xef\xbb\xbf' with '' and then do the int conversion it will work.

有更好的方法这是为任何类型的编码文件?

Is there a better way of doing this for any type of encoded file?

推荐答案

import codecs

with codecs.open(file, "r", "utf-8-sig") as f:
    a, b, c= map(int, f.readline().split(","))

这适用于Python 2.6.4。 codecs.open 调用打开文件并以unicode返回数据,从UTF-8解码并忽略初始BOM。

This works in Python 2.6.4. The codecs.open call opens the file and returns data as unicode, decoding from UTF-8 and ignoring the initial BOM.

这篇关于在Python中处理UTF-8数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆