正则表达式以匹配逗号分隔的key = value列表,其中value可以包含逗号 [英] Regular expression to match comma separated list of key=value where value can contain commas

查看:206
本文介绍了正则表达式以匹配逗号分隔的key = value列表,其中value可以包含逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个幼稚的解析器",它只执行以下操作:
[x.split('=') for x in mystring.split(',')]

I have a naive "parser" that simply does something like:
[x.split('=') for x in mystring.split(',')]

但是mystring可能类似于
'foo=bar,breakfast=spam,eggs'

However mystring can be something like
'foo=bar,breakfast=spam,eggs'

很明显,
天真的拆分器不会这样做.为此,我仅限于 Python 2.6标准库
因此,例如,不能使用 pyparsing .

Obviously,
The naive splitter will just not do it. I am limited to Python 2.6 standard library for this,
So for example pyparsing can not be used.

预期输出为
[('foo', 'bar'), ('breakfast', 'spam,eggs')]

Expected output is
[('foo', 'bar'), ('breakfast', 'spam,eggs')]

我正在尝试使用正则表达式执行此操作,但是面临以下问题:

I'm trying to do this with regex, but am facing the following problems:

我的第一次尝试
r'([a-z_]+)=(.+),?'
给我
[('foo', 'bar,breakfast=spam,eggs')]

My First attempt
r'([a-z_]+)=(.+),?'
Gave me
[('foo', 'bar,breakfast=spam,eggs')]

很明显,
.+设置为非贪婪并不能解决问题.

Obviously,
Making .+ non-greedy does not solve the problem.

所以,
我猜我必须以某种方式使最后一个逗号(或$)成为必需.
这样做实际上是行不通的,
r'([a-z_]+)=(.+?)(?:,|$)'
就像在逗号后面的值(包含一个值)中的内容被省略一样,
例如[('foo', 'bar'), ('breakfast', 'spam')]

So,
I'm guessing I have to somehow make the last comma (or $) mandatory.
Doing just that does not really work,
r'([a-z_]+)=(.+?)(?:,|$)'
As with that the stuff behind the comma in an value containing one is omitted,
e.g. [('foo', 'bar'), ('breakfast', 'spam')]

我认为我必须使用某种look-behind(?)操作.
问题
1.我使用哪个?或
2. 如何,我要这样做/做到这一点?

I think I must use some sort of look-behind(?) operation.
The Question(s)
1. Which one do I use? or
2. How do I do that/this?

修改:

基于下面的 daramarak 的答案,
我最终做了与 abarnert 差不多的事,后来又做了

Based on daramarak's answer below,
I ended up doing pretty much the same thing as abarnert later suggested in a slightly more verbose form;

vals = [x.rsplit(',', 1) for x in (data.split('='))]
ret = list()
while vals:
    value = vals.pop()[0]
    key = vals[-1].pop()
    ret.append((key, value))
    if len(vals[-1]) == 0:
        break

只是为了满足我的好奇心,使用 pure 正则表达式实际上有可能吗?也就是说,re.findall()将返回2元组的列表?

Just to satisfy my curiosity, is this actually possible with pure regular expressions? I.e so that re.findall() would return a list of 2-tuples?

推荐答案

仅出于比较目的,下面的正则表达式似乎也可以解决该问题:

Just for comparison purposes, here's a regex that seems to solve the problem as well:

([^=]+)    # key
=          # equals is how we tokenise the original string
([^=]+)    # value
(?:,|$)    # value terminator, either comma or end of string

这里的技巧是限制您在第二组中捕获的内容. .+吞下=符号,这是我们可以用来区分键和值的字符.完整的正则表达式不依赖任何回溯(因此它应与 re2 (如果需要的话),并且可以处理abarnert的示例.

The trick here it to restrict what you're capturing in your second group. .+ swallows the = sign, which is the character we can use to distinguish keys from values. The full regex doesn't rely on any back-tracking (so it should be compatible with something like re2, if that's desirable) and can work on abarnert's examples.

用法如下:

re.findall(r'([^=]+)=([^=]+)(?:,|$)', 'foo=bar,breakfast=spam,eggs,blt=bacon,lettuce,tomato,spam=spam')

哪个返回:

[('foo', 'bar'), ('breakfast', 'spam,eggs'), ('blt', 'bacon,lettuce,tomato'), ('spam', 'spam')]

这篇关于正则表达式以匹配逗号分隔的key = value列表,其中value可以包含逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆