将字符向量拆分为句子 [英] Split character vector into sentences
问题描述
我有以下字符向量:
"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
我想使用以下模式(即句号 - 空格 - 大写字母)将其拆分成句子:
I want to split it into sentences by using the following pattern (i.e. period - space - upper case letter):
"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"
因此,缩写后的句号不应该是一个新句子.我想在 R 中使用正则表达式来做到这一点.
Hence, period after abbrevations should not be a new sentence. I want to do this using regular expressions in R.
有人可以帮我吗?
推荐答案
使用 strsplit 的解决方案:
A solution using strsplit:
string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))
结果:
[1] "This is a very long character vector."
[2] "Why is it so long?"
[3] "I think lng. is short for long."
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"
[6] "That would be nice?"
这匹配任何标点字符后跟一个空格和一个大写字母.(?<=[[:punct:]])
保留字符串中匹配的分隔符之前的标点符号,(?=[AZ])
添加匹配的大写字母到匹配的分隔符后的字符串.
This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]])
keeps the punctuation in the string before the matched delimiter and (?=[A-Z])
adds the matched uppercase letter to the string after the matched delimiter.
我刚刚看到您没有在所需输出中的问号后拆分.如果您只想在."后拆分.你可以用这个:
I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:
unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))
给出
[1] "This is a very long character vector."
[2] "Why is it so long? I think lng. is short for long."
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"
这篇关于将字符向量拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!