在javascript中将字符串拆分为句子 [英] Split string into sentences in javascript

查看:118
本文介绍了在javascript中将字符串拆分为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我正在开发一个将长列分成短列的应用程序。为此,我将整个文本分成单词,但此刻我的正则表达式也将数字拆分。

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers too.

我这样做:

str = "This is a long string with some numbers [125.000,55 and 140.000] and an end. This is another sentence.";
sentences = str.replace(/\.+/g,'.|').replace(/\?/g,'?|').replace(/\!/g,'!|').split("|");

结果是:

Array [
    "This is a long string with some numbers [125.",
    "000,55 and 140.",
    "000] and an end.",
    " This is another sentence."
]

所需的结果将是:

Array [
    "This is a long string with some numbers [125.000, 140.000] and an end.",
    "This is another sentence"
]

我如何更改正则表达式才能实现此目的?我是否需要注意可能遇到的一些问题?或者它是否足以搜索

How do i have to change my regex to achieve this? Do i need to watch out for some problems i could run into? Or would it be good enough to search for ". ", "? " and "! "?

推荐答案

str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")

输出:

[ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
  'This is another sentence.' ]

细分:

([。?!]) =捕获

\s * =在前一个标记([。?!])之后捕获0个或多个空格字符。这会占用与英语语法匹配的标点符号后面的空格。

\s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

(?= [AZ]) =如果下一个字符在AZ(大写字母A到大写字母Z)的范围内,则前一个标记仅匹配。大多数英语语句以大写字母开头。以前的正则表达都没有考虑到这一点。

(?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.

替换操作使用:

"$1|"

我们使用了一个捕获组([。?!])我们捕获其中一个字符,并将其替换为 $ 1 (匹配)加上 | 。因此,如果我们捕获,那么替换将是?|

We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

最后,我们拆分管道 | 并得到我们的结果。

Finally, we split the pipes | and get our result.

所以,基本上,我们所说的是:

So, essentially, what we are saying is this:

1)找到标点符号(之一。)并捕获它们

1) Find punctuation marks (one of . or ? or !) and capture them

2)标点符号可以选择在它们之后包含空格。

2) Punctuation marks can optionally include spaces after them.

3)在标点符号后,我希望有一个大写字母。

3) After a punctuation mark, I expect a capital letter.

与之前提供的正则表达式不同,这将与英语语法完全匹配。

Unlike the previous regular expressions provided, this would properly match the English language grammar.

从那里:

4)我们通过附加管道替换捕获的标点符号 |

4) We replace the captured punctuation marks by appending a pipe |

5)我们拆分用于创建句子数组的管道。

5) We split the pipes to create an array of sentences.

这篇关于在javascript中将字符串拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆