将Python 3 unicode转换为utf-8 [英] Python 3 unicode to utf-8 on file

查看:870
本文介绍了将Python 3 unicode转换为utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析日志文件,但是文件格式始终为unicode.我想自动化的通常过程:

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:

  • 我在记事本中拉文件
  • 另存为...
  • 将编码从unicode更改为UTF-8
  • 然后在其上运行python程序
  • I pull file up in notepad
  • Save as...
  • change encoding from unicode to UTF-8
  • Then run python program on it

所以这是我想在Python 3.4中自动化的过程.几乎只是将文件更改为UTF-8或类似open(filename,'r',encoding='utf-8')的东西,尽管当我尝试在其上调用read()时,此行将我抛出此错误:

So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

如果我可以转换整个文件(如我的第一种情况)或只用UTF-8打开整个文件,那将非常有用,而我不必每次都进行str.encode(或类似的事情)我分析字符串的时间.

It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.

任何人都经历过这个,知道我应该使用哪种方法以及如何去做吗?

Anybody been through this and know which method I should use and how to do it?

在python3 repr中,我做到了

In the python3 repr, I did

>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')

所以现在我程序中的python代码使用open('file.txt','r',encoding='cp1252')打开文件.我正在运行很多正则表达式来浏览该文件,但它没有选择它(我认为是因为它不是utf-8).因此,我只需要弄清楚如何从cp1252切换到UTF-8.谢谢@Mark Ransom

So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you @Mark Ransom

推荐答案

记事本认为Unicode是Python的utf16. Windows"Unicode"文件以FF FE的字节顺序标记(BOM)开头,它表示小尾数UTF-16.这就是为什么使用utf8解码文件时会得到以下内容的原因:

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:

UnicodeDecodeError:'utf-8'编解码器无法解码位置0:无效的起始字节中的字节0xff

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

要转换为UTF-8,可以使用:

To convert to UTF-8, you could use:

with open('log.txt',encoding='utf16') as f:
    data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
    f.write(data)

请注意,许多Windows编辑器都喜欢文件开头的UTF-8签名,或者可以假设使用ANSI. ANSI实际上是本地语言环境.在美国Windows上,它是cp1252,但对于其他本地化版本,它会有所不同.如果您打开utf8.txt仍然看起来仍然是乱码,请在编写时使用encoding='utf-8-sig'.

Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

这篇关于将Python 3 unicode转换为utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆