拆分函数添加:xefxbbxbf... 到我的列表 [英] Split function add: xefxbbxbf... to my list
问题描述
我想打开我的 file.txt
并拆分此文件中的所有数据.
I want to open my file.txt
and split all data from this file.
这是我的file.txt
:
some_data1 some_data2 some_data3 some_data4 some_data5
这是我的python代码:
and here is my python code:
>>>file_txt = open("file.txt", 'r')
>>>data = file_txt.read()
>>>data_list = data.split(' ')
>>>print data
some_data1 some_data2 some_data3 some_data4 some_data5
>>>print data_list
['xefxbbxbfsome_data1', 'some_data1', "some_data1", 'some_data1', 'some_data1
']
正如你在这里看到的,当我打印我的 data_list
时,它会添加到我的列表中:xefxbbxbf
和这个:
代码>.这些是什么以及如何清除我的列表.
As you can see here, when I print my data_list
it adds to my list this: xefxbbxbf
and this:
. What are these and how can I clean my list from them.
谢谢.
推荐答案
您的文件中包含 UTF-8 BOM开始.
Your file contains UTF-8 BOM in the beginning.
要摆脱它,首先将您的文件内容解码为 unicode.
To get rid of it, first decode your file contents to unicode.
fp = open("file.txt")
data = fp.read().decode("utf-8-sig").encode("utf-8")
但最好不要将其编码回 utf-8
,而是使用 unicode
d 文本.有一个很好的规则:尽快将所有输入的文本数据解码为 unicode,并且只使用 unicode;并尽可能晚地将输出数据编码为所需的编码.这将使您免于许多头痛.
But better don't encode it back to utf-8
, but work with unicode
d text. There is a good rule: decode all your input text data to unicode as soon as possible, and work only with unicode; and encode the output data to the required encoding as late as possible. This will save you from many headaches.
要以某种编码读取更大的文件,请使用 io.open
或 codecs.open
.
To read bigger files in a certain encoding, use io.open
or codecs.open
.
还要检查这个.
使用 str.strip()
或 str.rstrip()
去掉换行符
.
Use str.strip()
or str.rstrip()
to get rid of the newline character
.
这篇关于拆分函数添加:xefxbbxbf... 到我的列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!