python正则表达式拆分段落 [英] python regular expression to split paragraphs

查看:344
本文介绍了python正则表达式拆分段落的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个人如何编写一个正则表达式以在python中用于拆分段落?

How would one write a regular expression to use in python to split paragraphs?

一个段落由2个换行符(\ n)定义.但是一个人可以有任意数量的空格/制表符以及换行符,并且仍应将其视为一个段落.

A paragraph is defined by 2 linebreaks (\n). But one can have any amount of spaces/tabs together with the line breaks, and it still should be considered as a paragraph.

我正在使用python,因此解决方案可以使用python的正则表达式语法这是扩展的. (可以利用(?P...)的东西)

I am using python so the solution can use python's regular expression syntax which is extended. (can make use of (?P...) stuff)

the_str = 'paragraph1\n\nparagraph2'
# splitting should yield ['paragraph1', 'paragraph2']

the_str = 'p1\n\t\np2\t\n\tstill p2\t   \n     \n\tp3'
# should yield ['p1', 'p2\t\n\tstill p2', 'p3']

the_str = 'p1\n\n\n\tp2'
# should yield ['p1', '\n\tp2']

我能带给我的最好的是:r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*',即

The best I could come with is: r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', i.e.

import re
paragraphs = re.split(r'[ \t\r\f\v]*\n[ \t\r\f\v]*\n[ \t\r\f\v]*', the_str)

但这很丑.有什么更好的吗?

but that is ugly. Anything better?

编辑:

r'\s*?\n\s*?\n\s*?'->这将使示例2和3失败,因为\s包含\n,因此它将允许使用超过2个\n的段落分隔符.

r'\s*?\n\s*?\n\s*?' -> That would make example 2 and 3 fail, since \s includes \n, so it would allow paragraph breaks with more than 2 \ns.

推荐答案

不幸的是,没有好的方法可以写空格,但不能换行".

Unfortunately there's no nice way to write "space but not a newline".

我认为您能做的最好的事情就是使用x修饰符添加一些空间,并尝试排除一些丑陋之处,但这是有问题的:(?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

I think the best you can do is add some space with the x modifier and try to factor out the ugliness a bit, but that's questionable: (?x) (?: [ \t\r\f\v]*? \n ){2} [ \t\r\f\v]*?

您还可以尝试仅为该字符类创建一个子规则,并将其插入3次.

You could also try creating a subrule just for the character class and interpolating it three times.

这篇关于python正则表达式拆分段落的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆