在 python 中解析字符串:如何在忽略引号内的换行符的同时拆分换行符 [英] parsing a string in python: how to split newlines while ignoring newline inside quotes

查看:122
本文介绍了在 python 中解析字符串:如何在忽略引号内的换行符的同时拆分换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本需要在 python 中解析.

I have a text that i need to parse in python.

这是一个字符串,我想将其拆分为行列表,但是,如果换行符 (\n) 在引号内,那么我们应该忽略它.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

例如:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

应该被解析为以下几行的列表:

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

我已经尝试过使用 split('\n'),但我不知道如何忽略引号.

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

有什么想法吗?

谢谢!

推荐答案

这里有一个更简单的解决方案.

Here's a much easier solution.

匹配 (?:"[^"]*"|.)+ 组.即引号中的内容或不是换行符的内容".

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

示例:

import re
re.findall('(?:"[^"]*"|.)+', text)

<小时>

注意:这将几个换行符合并为一个,因为空白行被忽略.为避免这种情况,还要给出一个空值:(?:"[^"]*"|.)+|(?!\Z).


NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

(?!\Z) 是一种令人困惑的表示不是字符串的结尾"的方式.(?! ) 是负前瞻;\Z 是字符串的结尾"部分.

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.

测试:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"

这篇关于在 python 中解析字符串:如何在忽略引号内的换行符的同时拆分换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆