读取文件并尝试删除所有非UTF-8字符 [英] read a file and try to remove all non UTF-8 chars

查看:109
本文介绍了读取文件并尝试删除所有非UTF-8字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取文件并将字符串转换为UTF-8字符串,以便删除文件字符串中的某些非utf-8字符,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

但是我遇到了以下错误,

AttributeError: 'str' object has no attribute 'decode'

更新:我尝试了答案所建议的代码,

file_str = open(file_path, 'r', encoding='utf-8').read()

,但是它没有消除非utf-8字符,那么如何删除它们呢?

解决方案

删除.decode('utf8')调用.您的文件数据已被 解码,因为在Python 3中,使用文本模式(默认)的open()调用返回了一个文件对象,该文件对象将为您将数据解码为Unicode字符串 >.

您可能要做,希望将编码添加到open()调用中以使其明确.否则,Python使用系统默认值,并且可能不是UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

例如,在Windows上,几乎可以肯定的是,对于UTF-8数据而言,默认编解码器肯定是错误的,但是直到尝试阅读文本时,您才会看到问题.您会发现您有 Mojibake ,因为使用CP1252或类似工具对UTF-8数据进行了解码8位编解码器.

有关更多详细信息,请参见 open()函数文档./p>

I am trying to read a file and convert the string to a UTF-8 string, in order to remove some non utf-8 chars in the file string,

file_str = open(file_path, 'r').read()
file_str = file_str.decode('utf-8')

but I got the following error,

AttributeError: 'str' object has no attribute 'decode'

Update: I tried the code as suggested by the answer,

file_str = open(file_path, 'r', encoding='utf-8').read()

but it didn't eliminate the non utf-8 chars, so how to remove them?

解决方案

Remove the .decode('utf8') call. Your file data has already been decoded, because in Python 3 the open() call with text mode (the default) returned a file object that decodes the data to Unicode strings for you.

You probably do want to add the encoding to the open() call to make this explicit. Otherwise Python uses a system default, and that may not be UTF-8:

file_str = open(file_path, 'r', encoding='utf8').read()

For example, on Windows, the default codec is almost certainly going to be wrong for UTF-8 data, but you won't see the problem until you try to read the text; you'd find your have a Mojibake as the UTF-8 data is decoded using CP1252 or a similar 8-bit codec.

See the open() function documentation for further details.

这篇关于读取文件并尝试删除所有非UTF-8字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆