如何删除非utf 8代码并另存为CSV文件python [英] how to remove non utf 8 code and save as a csv file python
问题描述
我有一些亚马逊评论数据,并且已经成功地从文本格式转换为CSV格式,现在的问题是当我尝试使用熊猫将其读取到数据框中时,出现错误msg: UnicodeDecodeError:"utf-8"编解码器无法解码位置13的字节0xf8:无效的起始字节
I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
我知道审核原始数据中必须包含一些非utf-8,如何删除非UTF-8并保存到另一个CSV文件中?
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
谢谢!
这是我将文本转换为csv的代码:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
感谢所有尝试帮助我的人. 因此,我通过修改代码中的输出格式解决了该问题:
Thanks to all of you trying to helping me out. So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
推荐答案
如果输入文件未采用utf-8编码,那么尝试以utf-8读取它可能不是一个好主意...
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
您基本上有两种方法来处理解码错误:
You have basically 2 ways to deal with decode errors:
- 使用可以接受任何字节的字符集,例如iso-8859-15,也称为latin9
- 如果输出应为utf-8但包含错误,请使用
errors=ignore
->静默删除非utf-8字符,或errors=replace
->用替换标记替换非utf-8字符(通常为?
)
- use a charset that will accept any byte such as iso-8859-15 also known as latin9
- if output should be utf-8 but contains errors, use
errors=ignore
-> silently removes non utf-8 characters, orerrors=replace
-> replaces non utf-8 characters with a replacement marker (usually?
)
例如:
f = open(INPUT_FILE_NAME,encoding="latin9")
或
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
这篇关于如何删除非utf 8代码并另存为CSV文件python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!