将字符串拆分为句子-忽略拆分的缩写 [英] Split string into sentences - ignoring abbreviations for splitting

查看:35
本文介绍了将字符串拆分为句子-忽略拆分的缩写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将此字符串拆分为句子,但是我需要处理缩写(单词的固定格式为 x.y.:

I'm trying to split this string into sentences, but I need to handle abbreviations (which have the fixed format x.y. as a word:

content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."

我尝试过此正则表达式:

I tried this regex:

content.replace(/([.?!])\s+(?=[A-Za-z])/g, "$1|").split("|");

但是您可以看到缩写存在问题.由于所有缩写均采用 x.y.格式,因此有可能将它们作为一个单词来处理,而无需在此时拆分字符串.

But as you can see there are problems with abbreviations. As all the abbreviations are of the format x.y. it should be possible to handle them as a word, without splitting the string at this point.

"This is a long string with some numbers 123.456,78 or 100.000 and e.g.", 
"some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e.", 
"in this one.", 
"here and abbr at the end x.y..",
"cool."

结果应为:

"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e. in this one.", 
"here and abbr at the end x.y..",
"cool."

推荐答案

解决方案是匹配并捕获缩写,并使用回调构建替换项:

The solution is to match and capture the abbreviations and build the replacement using a callback:

var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; 
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
  return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

正则表达式说明:

  • \ b(\ w \.\ w \.)-将缩写(由单词字符组成,然后由.组成,然后再次将a捕获并捕获到第1组中)字字符和一个.)作为一个完整字词
  • | -或...
  • ([.?!])\ s +(?= [A-Za-z]):
    • ([.?!])-将.?匹配并捕获到第2组中!>
    • \ s + -匹配1个或多个空格符号...
    • (?= [A-Za-z])-在ASCII字母之前.
    • \b(\w\.\w\.) - match and capture into Group 1 the abbreviation (consisting of a word character, then . and again a word character and a .) as a whole word
    • | - or...
    • ([.?!])\s+(?=[A-Za-z]):
      • ([.?!]) - match and capture into Group 2 either . or ? or !
      • \s+ - match 1 or more whitespace symbols...
      • (?=[A-Za-z]) - that are before an ASCII letter.

      这篇关于将字符串拆分为句子-忽略拆分的缩写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆