在python中拆分逗号分隔的字符串 [英] Splitting comma delimited strings in python

查看:605
本文介绍了在python中拆分逗号分隔的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题之前已经被问过和回答过很多次了.一些示例:[1][2].但似乎没有更普遍的东西.我正在寻找的是一种在不在引号或分隔符对内的逗号处拆分字符串的方法.例如:

s1 = 'obj<1, 2, 3>, x(4, 5), "msg, 带逗号"'

应该分成三个元素的列表

['obj<1, 2, 3>', 'x(4, 5)', '"msg, with逗号"']

现在的问题是,这会变得更加复杂,因为我们可以查看 <>() 对.

s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with逗号"'

应该分为:

['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with逗号"']

不使用正则表达式的天真解决方案是通过查找字符 ,<( 来解析字符串.如果 <(> 找到然后我们开始计算奇偶校验.如果奇偶校验为零,我们只能在逗号​​处拆分.例如说我们要拆分 s2,我们可以从 parity = 0 并且当我们到达 s2[3] 时,我们会遇到 < 这会将奇偶校验增加 1.奇偶校验只会在遇到 >) 遇到<( 时会增加.虽然奇偶校验不是 0 我们可以简单地忽略逗号并且不做任何拆分.

这里的问题是,有没有办法用正则表达式快速做到这一点?我真的在研究这个解决方案,但这似乎没有涵盖我给出的例子.

一个更通用的函数是这样的:

def split_at(text, delimiter, exceptions):"""在指定的分隔符处拆分文本,如果分隔符不是在例外"""

有些用途是这样的:

split_at('obj<1, 2, 3>, x(4, 5), "msg, 带逗号"', ',', [('<', '>'),('(', ')'), ('"', '"')]

正则表达式是否能够处理这个问题,或者是否有必要创建一个专门的解析器?

解决方案

虽然不能使用正则表达式,但下面的简单代码就可以达到预期的效果:

def split_at(text, delimiter, opens='<([', closes='>)]',quotes='"\''):结果 = []buff = ""等级 = 0is_quoted = 假对于文本中的字符:如果分隔符和级别中的字符 == 0 而不是 is_quoted:result.append(buff)buff = ""别的:增益 += 字符如果 char in 打开:等级 += 1如果 char in 关闭:级别 -= 1如果引号中的字符:is_quoted = 不是 is_quoted如果不是 buff == "":result.append(buff)返回结果

在解释器中运行:

<预><代码>>>>split_at('obj<1, 2, 3>, x(4, 5), "msg, 带逗号"', ',')#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg带逗号"']

This question has been asked and answered many times before. Some examples: [1], [2]. But there doesn't seem to be something somewhat more general. What I'm looking for is for a way to split strings at commas that are not within quotes or pairs of delimiters. For instance:

s1 = 'obj<1, 2, 3>, x(4, 5), "msg, with comma"'

should be split into a list of three elements

['obj<1, 2, 3>', 'x(4, 5)', '"msg, with comma"']

The problem now is that this can get more complicated since we can look into pairs of <> and ().

s2 = 'obj<1, sub<6, 7>, 3>, x(4, y(8, 9), 5), "msg, with comma"'

which should be split into:

['obj<1, sub<6, 7>, 3>', 'x(4, y(8, 9), 5)', '"msg, with comma"']

The naive solution without using regex is to parse the string by looking for the characters ,<(. If either < or ( are found then we start counting the parity. We can only split at a comma if the parity is zero. For instance say we want to split s2, we can start with parity = 0 and when we reach s2[3] we encounter < which will increase parity by 1. The parity will only decrease when it encounters > or ) and it will increase when it encounters < or (. While the parity is not 0 we can simply ignore the commas and not do any splitting.

The question here is, is there a way to this quickly with regex? I was really looking into this solution but this doesn't seem like it covers the examples I have given.

A more general function would be something like this:

def split_at(text, delimiter, exceptions):
    """Split text at the specified delimiter if the delimiter is not
    within the exceptions"""

Some uses would be like this:

split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',', [('<', '>'), ('(', ')'), ('"', '"')]

Would regex be able to handle this or is it necessary to create a specialized parser?

解决方案

While it's not possible to use a Regular Expression, the following simple code will achieve the desired result:

def split_at(text, delimiter, opens='<([', closes='>)]', quotes='"\''):
    result = []
    buff = ""
    level = 0
    is_quoted = False

    for char in text:
        if char in delimiter and level == 0 and not is_quoted:
            result.append(buff)
            buff = ""
        else:
            buff += char

            if char in opens:
                level += 1
            if char in closes:
                level -= 1
            if char in quotes:
                is_quoted = not is_quoted

    if not buff == "":
        result.append(buff)

    return result

Running this in the interpreter:

>>> split_at('obj<1, 2, 3>, x(4, 5), "msg, with comma"', ',')                                                                                                                                 
#=>['obj<1, 2, 3>', ' x(4, 5)', ' "msg with comma"']

这篇关于在python中拆分逗号分隔的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆