将字符向量拆分为句子 [英] Split character vector into sentences

查看:47
本文介绍了将字符向量拆分为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下字符向量:

"This is a very long character vector. Why is it so long? I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"

我想使用以下模式(即句号 - 空格 - 大写字母)将其拆分成句子:

I want to split it into sentences by using the following pattern (i.e. period - space - upper case letter):

"This is a very long character vector."
"Why is it so long? I want to split this vector into senteces by using e.g. strssplit."
"Can someone help me?"
"That would be nice?"

因此,缩写后的句号不应该是一个新句子.我想在 R 中使用正则表达式来做到这一点.

Hence, period after abbrevations should not be a new sentence. I want to do this using regular expressions in R.

有人可以帮我吗?

推荐答案

使用 strsplit 的解决方案:

A solution using strsplit:

string <- "This is a very long character vector. Why is it so long? I think lng. is short for long. I want to split this vector into senteces by using e.g. strssplit. Can someone help me? That would be nice?"
unlist(strsplit(string, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

结果:

[1] "This is a very long character vector."                             
[2] "Why is it so long?"                                                
[3] "I think lng. is short for long."                                   
[4] "I want to split this vector into senteces by using e.g. strssplit."
[5] "Can someone help me?"                                              
[6] "That would be nice?" 

这匹配任何标点字符后跟一个空格和一个大写字母.(?<=[[:punct:]]) 保留字符串中匹配的分隔符之前的标点符号,(?=[AZ]) 添加匹配的大写字母到匹配的分隔符后的字符串.

This matches any punctuation character followed by a whitespace and a uppercase letter. (?<=[[:punct:]]) keeps the punctuation in the string before the matched delimiter and (?=[A-Z]) adds the matched uppercase letter to the string after the matched delimiter.

我刚刚看到您没有在所需输出中的问号后拆分.如果您只想在."后拆分.你可以用这个:

I just saw you didn't split after a question mark in your desired output. If you only want to split after a "." you can use this:

unlist(strsplit(string, "(?<=\\.)\\s(?=[A-Z])", perl = T))

给出

[1] "This is a very long character vector."                             
[2] "Why is it so long? I think lng. is short for long."                
[3] "I want to split this vector into senteces by using e.g. strssplit."
[4] "Can someone help me? That would be nice?"  

这篇关于将字符向量拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆