pandas read_csv,最后一列包含逗号 [英] pandas read_csv with final column containing commas
问题描述
所以我有一个csv数据集,我的书是很好的形式,我试图获得 pandas
包加载它正确。标题由5个列名称组成,但最后一列由包含非转义逗号的JSON对象组成。例如
So I have a csv dataset that by my book is well formed, and I'm trying to get the pandas
package to load it correctly. The header consists of 5 column names , but the final column consists of JSON objects which contain unescaped commas. e.g.
A,B,C,D,E
1,2,3,4,{K1:V1,K2:V2}
我正在加载一个简单的 = pd.read_csv('data / training.dat')
I'm loading my data with a simple training = pd.read_csv('data/training.dat')
然而,pandas显然是将额外的逗号误解为新的未标记列,我得到这样的错误:
however, pandas is clearly misinterpreting the additional commas as new unlabeled columns, and I'm getting an error like this:
CParserError: Error tokenizing data. C error: Expected 75 fields in line 3, saw 84
我试图浏览文档,但是明显失败,有没有人知道如何正确配置 pd.read_csv
命令来正确解析它?
猜测替代方法是我可以一起编写一个脚本,使用它们的键作为列来联合JSON对象。
I guess the alternative is I could hack together a script that flattens the JSON objects using a union of their keys as columns.
推荐答案
它可以用 {和
替换
和}
}
,它可以正确读取: pd.read_csv('data / training.dat',quotechar =' skipinitialspace = True)
If it feasible for you to replace {
with "{
, and }
with }"
, it can be read correctly by: pd.read_csv('data/training.dat',quotechar='"',skipinitialspace=True)
解决方案:
In [205]:
print pd.read_csv('a.data',sep=",(?![^{]*\})", header=None)
0 1 2 3 4
0 A B C D E
1 1 2 3 4 {K1:V1,K2:V2}
[2 rows x 5 columns]
这篇关于pandas read_csv,最后一列包含逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!