句号后没有空格时如何将文本拆分为句子? [英] How to split text into sentences when there is no space after full stop?

查看：149 发布时间：2020/5/18 1:08:30 python regex nlp nltk

本文介绍了句号后没有空格时如何将文本拆分为句子?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个类似

'由Lapindo Brantas Inc.运营的东爪哇泗水附近的一口气井自去年5月开始喷出蒸汽泥，淹没了村庄，工业和田野.去年五月以来，该公司一直在喷出蒸腾的泥浆，淹没了村庄，工厂和田地.上周，印度尼西亚社会福利协调部长阿伯里萨尔·巴克里(Aburizal Bakrie)的家族企业控制着拉平多·布兰塔斯(Lapindo Brantas)，他说这座火山是与钻探活动无关的自然灾害".总统苏西洛·班邦·尤多约诺(Susilo Bambang Yudhoyono)上个月命令拉平多支付3.8万亿印尼盾(4.207亿美元)的赔偿和费用.

我想将其拆分为句子.我在网上发现的NLTK或任何标准正则表达式都失败了.

I want to split it into sentences. NLTK or any standard regex which I find online fails.

推荐答案

您可以使用正则表达式正向查找来在句子的末尾添加空格，然后将其传递给您选择的工具.这会为尚无空格的句点增加一个空格，但会跳过非字母数字(例如逗号).通过坚持使用字符类而不是A-Z，这对任何语言都适用.

You can use a regex positive lookahead to add spaces to the end of sentences and then pass it to the tool of your choice. This adds a space to periods that don't already have one, but skips non-alphanumerics like commas. By sticking to character classes instead of, say, A-Z, this works for any language.

>>> re.sub(r'\.(?=[^ \W\d])', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._')
'Foo bar. Baz Inc., foobar. 1.1, and abc. _'

您可以通过添加另一个先行搜索斜杠来捕获一些URL

You can catch some urls by adding another lookahead searching for slashes

>>> re.sub(r'\.(?=[^ \W\d])(?=[^\w*]/)', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever')
'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever'

这篇关于句号后没有空格时如何将文本拆分为句子?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

句号后没有空格时如何将文本拆分为句子? [英] How to split text into sentences when there is no space after full stop?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

句号后没有空格时如何将文本拆分为句子? [英] How to split text into sentences when there is no space after full stop?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭