句号后没有空格时如何将文本拆分为句子? [英] How to split text into sentences when there is no space after full stop?
问题描述
我有一个类似
'由Lapindo Brantas Inc.运营的东爪哇泗水附近的一口气井自去年5月开始喷出蒸汽泥,淹没了村庄,工业和田野.去年五月以来,该公司一直在喷出蒸腾的泥浆,淹没了村庄,工厂和田地.上周,印度尼西亚社会福利协调部长阿伯里萨尔·巴克里(Aburizal Bakrie)的家族企业控制着拉平多·布兰塔斯(Lapindo Brantas),他说这座火山是与钻探活动无关的自然灾害".总统苏西洛·班邦·尤多约诺(Susilo Bambang Yudhoyono)上个月命令拉平多支付3.8万亿印尼盾(4.207亿美元)的赔偿和费用.
我想将其拆分为句子.我在网上发现的NLTK或任何标准正则表达式都失败了.
I want to split it into sentences. NLTK or any standard regex which I find online fails.
推荐答案
您可以使用正则表达式正向查找来在句子的末尾添加空格,然后将其传递给您选择的工具.这会为尚无空格的句点增加一个空格,但会跳过非字母数字(例如逗号).通过坚持使用字符类而不是A-Z
,这对任何语言都适用.
You can use a regex positive lookahead to add spaces to the end of sentences and then pass it to the tool of your choice. This adds a space to periods that don't already have one, but skips non-alphanumerics like commas. By sticking to character classes instead of, say, A-Z
, this works for any language.
>>> re.sub(r'\.(?=[^ \W\d])', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._')
'Foo bar. Baz Inc., foobar. 1.1, and abc. _'
您可以通过添加另一个先行搜索斜杠来捕获一些URL
You can catch some urls by adding another lookahead searching for slashes
>>> re.sub(r'\.(?=[^ \W\d])(?=[^\w*]/)', '. ', 'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever')
'Foo bar.Baz Inc., foobar. 1.1, and abc._ http://www.example.com/whatever'
这篇关于句号后没有空格时如何将文本拆分为句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!