如何删除非utf 8代码并另存为CSV文件python [英] how to remove non utf 8 code and save as a csv file python

查看:175
本文介绍了如何删除非utf 8代码并另存为CSV文件python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些亚马逊评论数据,并且已经成功地从文本格式转换为CSV格式,现在的问题是当我尝试使用熊猫将其读取到数据框中时,出现错误msg: UnicodeDecodeError:"utf-8"编解码器无法解码位置13的字节0xf8:无效的起始字节

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte

我知道审核原始数据中必须包含一些非utf-8,如何删除非UTF-8并保存到另一个CSV文件中?

I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?

谢谢!

这是我将文本转换为csv的代码:

Here is the code i convert to text to csv:

import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
    "product/productId",
    "review/userId",
    "review/profileName",
    "review/helpfulness",
    "review/score",
    "review/time",
    "review/summary",
    "review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")

outfile = open(OUTPUT_FILE_NAME,"w")

outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:

   line = line.strip()  
   #need to reomve the , so that the comment review text won't be in many columns
   line = line.replace(',','')

   if line == "":
      outfile.write(",".join(currentLine))
      outfile.write("\n")
      currentLine = []
      continue
   parts = line.split(":",1)
   currentLine.append(parts[1])

if currentLine != []:
    outfile.write(",".join(currentLine))
f.close()
outfile.close()

感谢所有尝试帮助我的人. 因此,我通过修改代码中的输出格式解决了该问题:

Thanks to all of you trying to helping me out. So I have solved it by modify the output format in my code:

 outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")

推荐答案

如果输入文件未采用utf-8编码,那么尝试以utf-8读取它可能不是一个好主意...

If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...

您基本上有两种方法来处理解码错误:

You have basically 2 ways to deal with decode errors:

  • 使用可以接受任何字节的字符集,例如iso-8859-15,也称为latin9
  • 如果输出应为utf-8但包含错误,请使用errors=ignore->静默删除非utf-8字符,或errors=replace->用替换标记替换非utf-8字符(通常为?)
  • use a charset that will accept any byte such as iso-8859-15 also known as latin9
  • if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)

例如:

f = open(INPUT_FILE_NAME,encoding="latin9")

f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')

这篇关于如何删除非utf 8代码并另存为CSV文件python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆