在段落分隔符是非标准的段落上拆分文本 [英] Split text on paragraphs where paragraph delimiters are non-standard

查看:28
本文介绍了在段落分隔符是非标准的段落上拆分文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有标准段落格式的文本(一个空行后跟一个缩进),比如文本 1,那么使用 text.split("\n\n") 提取段落就很容易了.

If I have text with standard paragraph formatting (a blank line followed by an indent) such as text 1 it's easy enough to extract the paragraphs using text.split("\n\n").

文本 1:

      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales   
 ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc.

      Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat    
 vitae velit,etc.

但是,如果我的文本具有非标准段落格式(例如文本 2)怎么办?没有空行和可变的前导空格.

But what if I have text with non-standard paragraph formatting such as text 2? No blank lines and variable leading whitespace.

文本 2:

      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus sit amet sapien velit, ac sodales   
 ante. Integer mattis eros non turpis interdum et auctor enim consectetur, etc.
    Praesent molestie suscipit bibendum. Donec justo purus, venenatis eget convallis sed, feugiat    
 vitae velit,etc.

由于前导空格对于标准格式和非标准格式都很常见,因此我考虑过在正则表达式匹配上为前导空格编制索引并以这种方式获得段落分隔符,但必须有一种更优雅的方法来做到这一点.

Since leading whitespace is common to both standard and non-standard formats I've thought about indexing on the regex match for leading whitespace and getting the paragraph breaks that way, but there has to be a more elegant way to do this.

推荐答案

您提出的正则表达式解决方案似乎足够优雅:

The regex solution you propose seems elegant enough:

re.split('\s{4,}',text)

这使用 4 个连续的空白字符作为段落分隔符.你可以使用 '\n\s{3,}' 或类似的东西,如果它更合适的话.

This uses 4 consecutive whitespace chars as paragraph delimiter. You can use '\n\s{3,}' or something similar, if it fits better.

这篇关于在段落分隔符是非标准的段落上拆分文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆