在pandas中用逗号和字符读取CSV文件时出现问题 [英] Problems reading CSV file with commas and characters in pandas

查看:1923
本文介绍了在pandas中用逗号和字符读取CSV文件时出现问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用pandas读取csv文件,该文件有一个名为Tags的列,其中包含用户提供的标签,并具有标签,例如 - ,,'',1950年代,16世纪。由于这些是用户提供的,所以还有许多特殊字符也被错误地输入。问题是,我无法使用pandas read_csv打开csv文件。它显示错误:Cparser,错误标记化数据。有人可以帮我把csv文件读入pandas吗?

I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. Since these are user provided, there are many special characters which are entered by mistake as well. The issue is that I cannot open the csv file using pandas read_csv. It shows error:Cparser, error tokenizing data. Can someone help me with reading the csv file into pandas?

推荐答案

好吧。从格式不正确的CSV开始,我们无法读取:

Okay. Starting from a badly formatted CSV we can't read:

>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
  File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
    pd.read_csv("unquoted.csv", header=None)
[...]
  File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6

我们可以制作更好的版本事实上,最后三列是良好的行为:

We can make a nicer version, taking advantage of the fact the last three columns are well-behaved:

import csv

with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for line in reader:
        newline = [','.join(line[:-3])] + line[-3:]
        writer.writerow(newline)

产生

>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345

,然后我们可以阅读:

>>> pd.read_csv("quoted.csv", header=None)
                 0                1         2    3
0           1950's  xyz.nl/user_003       bad  123
1  17th,red,flower  xyz.nl/user_001      good  203
2              NaN  xyz.nl/user_239  not very  345

固定这个问题在源和获得数据在一个可容忍的格式,虽然。这样的技巧不应该是必要的,它很容易无法修复。

I'd look into fixing this problem at source and getting data in a tolerable format, though. Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair.

这篇关于在pandas中用逗号和字符读取CSV文件时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆