如何在Python中分割逗号分隔的字符串,除了引号内的逗号 [英] How do I split a comma delimited string in Python except for the commas that are within quotes
问题描述
我想在python中拆分逗号分隔的字符串。对我来说,棘手的部分是数据中的一些字段在它们中有一个逗号,它们用引号(或
'
示例:)。 p>
hey,hello ,,hello,world,'hey,world'
/ pre>
需要拆分为5个部分,如下所示
['hey','hello','','hello,world','hey,world']
任何想法/想法/建议/帮助如何解决Python中的上述问题将非常感谢。
谢谢你,
Vish解决方案(编辑:原来的答案有空的字段在边缘的麻烦,由于
。
import re
def parse_fields(text):
r
>>>> list(parse_fields('hey,hello ,,hello,world,\'hey,world \''))
['hey','hello','','hello,world' 'hey,world']
>>>> list(parse_fields('hey,hello ,,hello,world,\'hey,world\','))
['hey','hello','','hello, ,'hey,world','']
>>>> list(parse_fields(',hey,hello ,,hello,world,\'hey,world\','))
['','hey','hello',' hello,world','hey,world','']
>>> list(parse_fields(''))
['']
>>>> list(parse_fields(','))
['','']
>>> list(parse_fields('testing,quotes not atthebeginning \'of \'the,string'))
['testing','quote不在' 'the','string']
>>> list(parse_fields('testing,unterminated quotes'))
['testing','unterminated quotes']
pos = 0
exp =编译(r(['?)(。*?)\1(,| $))
while True:
m = exp.search
result = m.group(2)
separator = m.group(3)
产生结果
如果不是分隔符:
break
pos = m.end(0)
如果__name__ ==__main__:
import doctest
doctest.testmod()
([']?)
(。*?)
匹配字符串本身,贪婪匹配,根据需要匹配,而不用整个字符串,这被分配给result
,这是我们实际产生的结果。
\1
是一个反向引用,以匹配我们之前匹配的同一单引号或双引号(如果有)。
(,| $)
匹配分隔每个条目的逗号或行尾。这被分配给separator
。
如果分隔符为假(例如,空),这意味着没有分隔符,所以我们在字符串的结尾 - 我们完成了。否则,我们根据正则表达式完成的位置更新新的开始位置(
m.end(0)
),然后继续循环。I am trying to split a comma delimited string in python. The tricky part for me here is that some of the fields in the data themselves have a comma in them and they are enclosed within quotes (
"
or'
). The resulting split string should also have the quotes around the fields removed. Also, some fields can be empty.Example:
hey,hello,,"hello,world",'hey,world'
needs to be split into 5 parts like below
['hey', 'hello', '', 'hello,world', 'hey,world']
Any ideas/thoughts/suggestions/help with how to go about solving the above problem in Python would be much appreciated.
Thank You, Vish
解决方案(Edit: The original answer had trouble with empty fields on the edges due to the way
re.findall
works, so I refactored it a bit and added tests.)import re def parse_fields(text): r""" >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\'')) ['hey', 'hello', '', 'hello,world', 'hey,world'] >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\',')) ['hey', 'hello', '', 'hello,world', 'hey,world', ''] >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\',')) ['', 'hey', 'hello', '', 'hello,world', 'hey,world', ''] >>> list(parse_fields('')) [''] >>> list(parse_fields(',')) ['', ''] >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string')) ['testing', 'quotes not at "the" beginning \'of\' the', 'string'] >>> list(parse_fields('testing,"unterminated quotes')) ['testing', '"unterminated quotes'] """ pos = 0 exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""") while True: m = exp.search(text, pos) result = m.group(2) separator = m.group(3) yield result if not separator: break pos = m.end(0) if __name__ == "__main__": import doctest doctest.testmod()
(['"]?)
matches an optional single- or double-quote.
(.*?)
matches the string itself. This is a non-greedy match, to match as much as necessary without eating the whole string. This is assigned toresult
, and it's what we actually yield as a result.
\1
is a backreference, to match the same single- or double-quote we matched earlier (if any).
(,|$)
matches the comma separating each entry, or the end of the line. This is assigned toseparator
.If separator is false (eg. empty), that means there's no separator, so we're at the end of the string--we're done. Otherwise, we update the new start position based on where the regex finished (
m.end(0)
), and continue the loop.这篇关于如何在Python中分割逗号分隔的字符串,除了引号内的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!