如何使用csv模块处理字段值内的双引号? [英] How to handle double quotes inside field values with csv module?
问题描述
我尝试从不能控制的外部系统解析CSV文件。
I'm trying to parse CSV files from an external system which I have no control of.
- 逗号用作分隔符
- 当单元格包含逗号时,它包装在引号中,所有其他引号都用另一个引号字符转义。
示例CSV:
qwerty,abcd,efg
qw""erty,"a""b""c""d,ef""""g"
应解析为:
[['qw"erty', 'a"b"c"d,ef""g']]
但是,我认为Python的csv模块当单元格首先未包含在引号字符中时,不会期望转义字符。
csv.reader(my_file)
(默认值 doublequote = True
)返回:
However, I think that Python's csv module does not expect quote characters to be escaped when cell was not wrapped in quote chars in the first place.
csv.reader(my_file)
(with default doublequote=True
) returns:
['qw""erty', 'a"b"c"d,ef""g']
有没有办法用python csv模块解析这个?
Is there any way to parse this with python csv module ?
推荐答案
在@JackManey注释后,他建议用'替换双引号中的
。''
\\'
Following on @JackManey comment where he suggested to replace all instances of '""'
inside of double quotes with '\\"'
.
识别我们当前是否在双引号单元格内部是不必要的,我们可以替换所有实例''
与'\\'
。
Python文档说:
Recognizing if we are currently inside of double quoted cells turned out to be unnecessary and we can replace all instances of '""'
with '\\"'
.
Python documentation says:
阅读时,escapechar从以下字符中删除任何特殊含义
On reading, the escapechar removes any special meaning from the following character
在原始单元格已经包含转义字符的情况下,例如:'qw \\\\erty
生成 [[ qw\\erty]]
。因此,我们必须在解析之前转义转义字符。
However this would still break in the case where original cell already contains escape characters, example: 'qw\\\\""erty'
producing [['qw\\"erty']]
. So we have to escape the escape characters before parsing too.
最终解决方案:
with open(file_path, 'rb') as f:
content = f.read().replace('\\', '\\\\').replace('""', '\\"')
reader = csv.reader(StringIO(content), doublequote=False, escapechar='\\')
return [row for row in reader]
这篇关于如何使用csv模块处理字段值内的双引号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!