Python:列数据内的定界符问题 [英] Python: Issue with delimiter inside column data

查看:57
本文介绍了Python:列数据内的定界符问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与另一个问题没有重复,因为我不想删除这些行.上述帖子中接受的答案与该帖子有很大不同,不是旨在维护所有数据.

This is no duplicate of another question, as I do not want to drop the rows. The accepted answer in the aforementioned post is very different from this one, and not aimed at maintaining all the data.

问题:格式错误的csv文件中列数据内的定界符

Problem: Delimiter inside column data from badly formatted csv-file

尝试过的解决方案: csv模块,shlex,StringIO(SO上没有可用的解决方案)

Tried solutions: csv module , shlex, StringIO (no working solution on SO)

示例数据

分隔符位于第三个数据字段中,并用(多个)双引号引起来:

Delimiters are inside the third data field, somewhere enclosed by (multiple) double-quotes:

08884624;6/4/2016;Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\";9999;resell:no;package:1;test
0085658;6/4/2016;Logic 111BLACK.compat: 29,46 cm (11.6\"\")deep: 4;06 cm height: 25;9 cm\"\";9999;resell:no;package:1;test
4235846;6/4/2016;Case Logic. compat: 39,624 cm (15.6\"\") deep: 3;05 cm height: 3 cm\"\";9999;resell:no;package:1;test
400015;6/4/2016;Cable\"\"Easy Cover\"\"\"\";1;5 m 30 Silver\"\";9999;resell:no;package:1;test
9791118;6/4/2016;Network routing 21,5\"\" (2013) 2;7GHz\"\";9999;resell:no;package:1;test
477000;6/4/2016;iGlaze. deep: 9,6 mm (67.378\"\") height: 14;13 cm\"\";9999;resell:no;package:1;test
4024001;6/4/2016;DigitalBOX. tuner: Digital, Power: 20 W., Diag: 7,32 cm (2.88\"\"). Speed 10;100 Mbit/s\"\";9999;resell:no;package:1;test

所需的样本输出

固定长度7:

['08884624','6/4/2016', 'Network routing 21,5\" 4,8GHz1TB hddQwerty', '9999', 'resell:no', 'package:1', 'test']

通过csv阅读器解析不能解决问题(skipinitialspace不是问题),shlex没有用,StringIO也没有帮助...

Parsing through csv reader doesn't fix the problem (skipinitialspace is not the problem), shlex is no use and StringIO is also of no help...

我最初的想法是逐行导入,并替换为';'.在行中逐个元素.但是导入是个问题,因为它会在每个';'上分开.

My initial idea was to import row by row, and replace ';' element by element in row. But the importing is the problem, as it splits on every ';'.

数据来自具有300.000+行的较大文件(并非所有行都存在此问题).欢迎任何建议.

The data comes from a larger file with 300.000+ rows (not all the rows have this problem). Any advice is welcome.

推荐答案

您知道输入字段的数量,并且由于只有一个字段的格式不正确,您可以简单地在; 上拆分,然后将中位数字段组合回一个单一字段:

As you know the number of input fields, and as only one field is badly formatted, you can simply split on ; and then combine back the median fields into one single one:

for line in file:
    temp_l = line.split(';')
    lst = temp_l[:2] + [ ';'.join(l[2:-4]) ] + l[-4:] #lst should contain the expected fields

我什至没有尝试处理双引号,因为我不明白您是如何从网络路由21,5 \"\" 4; 8GHz1TB hddQwerty \"\" 传递到网络路由21,5 \" 4,8GHz1TB hddQwerty" ...

I did not even try to process the double quotes, because I could not understand how you pass from Network routing 21,5\"\" 4;8GHz1TB hddQwerty\"\" to 'Network routing 21,5\" 4,8GHz1TB hddQwerty'...

这篇关于Python:列数据内的定界符问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆