解析 pandas 中的CSV文件,最后一列中带有逗号 [英] Parsing CSV file in pandas with commas in last column

查看:161
本文介绍了解析 pandas 中的CSV文件,最后一列中带有逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一些格式错误的CSV数据,需要将其读取到Pandas数据框中.我无法更改数据的记录方式(它来自其他地方),因此请不要提出任何建议.

I'm stuck with some poorly formatted CSV data that I need to read into a Pandas dataframe. I cannot change how the data is being recorded (it's coming from someplace else), so please no solutions suggesting that.

大多数数据都可以,但是某些行的最后一列带有逗号.简化示例:

Most of the data is fine, but some rows have commas in the last column. A simplified example:

column1 is fine,column 2 is fine,column3, however, has commas in it!

所有行应具有相同的列数(3),但此示例当然会破坏CSV阅读器,因为逗号建议实际上有3列,但有5列.

All rows should have the same number of columns (3), but this example of course breaks the CSV reader because the commas suggest there are 5 columns when in fact there are 3.

请注意,没有报价允许我使用标准CSV阅读器工具来解决此问题.

Notice that there is no quoting that would allow me to use the standard CSV reader tools to handle this problem.

要做的知道,多余的逗号总是出现在最后(最右边)的列中.这意味着我可以使用一种归结为以下解决方案的解决方案:

What I do know, however, is that the extra comma(s) always occur in the last (rightmost) column. This means I can use a solution that boils down to:

始终假设从左数起有3列,并将所有多余的逗号解释为第3列中的字符串内容".或者用不同的措词,将前两个逗号解释为列分隔符,但假定任何后续逗号只是列3中字符串的一部分."

"Always assume there are 3 columns, counting from the left, and interpret all extra commas as string content within column 3". Or, worded differently, "Interpret the first two commas as column separators, but assume any subsequent commas are just part of the string in column 3."

我可以想到许多方法来实现此目的,但是我的问题是:是否有任何优雅,简洁的方式来解决此问题,最好是在我致电pandas.csv_reader(...)的范围内?

I can think of plenty of kludgy ways to accomplish this, but my question is: Is there any elegant, concise way of addressing this, preferably within my call to pandas.csv_reader(...)?

推荐答案

修复csv,然后正常进行:

Fix the csv, then proceed normally:

import csv
with open('path/to/broken.csv', 'rb') as f, open('path/to/fixed.csv', 'wb') as g:
    writer = csv.writer(g, delimiter=',')
    for line in f:
        row = line.split(',', 2)
        writer.writerow(row)


import pandas as pd
df = pd.read_csv('path/to/fixed.csv')

这篇关于解析 pandas 中的CSV文件,最后一列中带有逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆