缩短(限制)句子的长度 [英] Shorten (Limit) the length of a sentence

查看:21
本文介绍了缩短(限制)句子的长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一列很长的名字,我想将这些名字删减到最大40 个字符.

I have column of long names and I would like to cut these to max 40 characters length.

示例数据:

x <- c("This is the longest sentence in world, so now just make it longer",
 "No in fact, this is the longest sentence in entire world, world, world, world, the whole world")

我想将句子长度缩短到大约 40 (-/+ 3 nchar),这样我就不会缩短单词中间的句子.(所以长度是根据单词之间的空格决定的).

I would like to shorten the sentece length to about 40 (-/+ 3 nchar) so that I don't shorten the sentence in the middle of an word. (So the length is decised on empty space between words).

我还想在缩短的句子后添加3个点.

Also I would like to add 3 dots after the shortened sentece.

所需的输出是这样的:

c("This is the longest sentence...","No in fact, this is the longest...")

这个函数只会在 40 个字符处盲目缩短.:

strtrim(x, 40)

推荐答案

好的,我现在有了更好的解决方案 :)

Ok, I have better solution now :)

x <- c("This is the longest sentence in world, so now just make it longer","No in fact, this is the longest sentence in entire world, world, world, world, the whole world")

extract <- function(x){
  result <- stri_extract_first_regex(x, "^.{0,40}( |$)")
  longer <- stri_length(x) > 40
  result[longer] <- stri_paste(result[longer], "...")
  result
}
extract(x)
## [1] "This is the longest sentence in world, ..."   "No in fact, this is the longest sentence ..."

新旧基准(32 000 句):

Benchmarks new vs old (32 000 sentences):

microbenchmark(sapply(x, cutAndAddDots, USE.NAMES = FALSE), extract(x), times=5)
Unit: milliseconds
                                        expr        min         lq     median         uq      max neval
 sapply(x, cutAndAddDots, USE.NAMES = FALSE) 3762.51134 3762.92163 3767.87134 3776.03706 3788.139     5
                                  extract(x)   56.01727   57.18771   58.50321   79.55759   97.924     5

旧版本

此解决方案需要 stringi 包,并且始终在字符串末尾添加三个点 ....

This solution requires stringi package and ALWAYS adds three dots ... to the end of string.

require(stringi)
sapply(x, function(x) stri_paste(stri_wrap(x, 40)[1],"..."),USE.NAMES = FALSE)
## [1] "This is the longest sentence in world..." "No in fact, this is the longest..." 

这个仅在超过 40 个字符的句子中添加三个点:

This one adds the three dots only to sentences which are longer than 40 characters:

require(stringi)
cutAndAddDots <- function(x){
  w <- stri_wrap(x, 40)
  if(length(w) > 1){
    stri_paste(w[1],"...")
  }else{
    w[1]
  }
}
sapply(x, cutAndAddDots, USE.NAMES = FALSE)
## [1] "This is the longest sentence in world" "No in fact, this is the longest..."   

性能说明stri_wrap 中设置 normalize=FALSE 可能会加快大约 3 倍的速度(在 30 000 个句子上测试)

PERFORMANCE NOTE Setting normalize=FALSE in stri_wrap may speed up this roughly 3 times (tested on 30 000 sentences)

测试数据:

x <- stri_rand_lipsum(3000)
x <- unlist(stri_split_regex(x,"(?<=\\.) "))
head(x)
[1] "Lorem ipsum dolor sit amet, vel commodo in."                                                    
[2] "Ultricies mauris sapien lectus dignissim."                                                      
[3] "Id pellentesque semper turpis habitasse egestas rutrum ligula vulputate laoreet mollis id."     
[4] "Curabitur volutpat efficitur parturient nibh sociosqu, faucibus tellus, eleifend pretium, quis."
[5] "Feugiat vel mollis ultricies ut auctor."                                                        
[6] "Massa neque auctor lacus ridiculus."                                                            
stri_length(head(x))
[1] 43 41 90 95 39 35

cutAndAddDots <- function(x){
   w <- stri_wrap(x, 40, normalize = FALSE)
   if(length(w) > 1){
     stri_paste(w[1],"...")
   }else{
     w[1]
   }
 }
 cutAndAddDotsNormalize <- function(x){
   w <- stri_wrap(x, 40, normalize = TRUE)
   if(length(w) > 1){
     stri_paste(w[1],"...")
   }else{
     w[1]
   }
 }
 require(microbenchmark)
 microbenchmark(sapply(x, cutAndAddDots, USE.NAMES = FALSE),sapply(x, cutAndAddDotsNormalize, USE.NAMES = FALSE),times=3)
Unit: seconds
                                                 expr       min        lq    median        uq       max
          sapply(x, cutAndAddDots, USE.NAMES = FALSE)  3.917858  3.967411  4.016964  4.055571  4.094178
 sapply(x, cutAndAddDotsNormalize, USE.NAMES = FALSE) 13.493732 13.651451 13.809170 13.917854 14.026538

这篇关于缩短(限制)句子的长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆