删除标点符号,但保留表情符号? [英] Remove punctuation but keeping emoticons?

查看:135
本文介绍了删除标点符号,但保留表情符号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以删除所有标点符号,但保留诸如此类的表情符号

Is that possible to remove all the punctuations but keeping the emoticons such as

:-(

:)

:D

:p

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

推荐答案

与@gagolews的解决方案相比,这是一种较不复杂的方法,而且速度可能较慢.它要求您为它提供一个图释字典.您可以创建它,也可以使用qdapDictionaries包中的那个.基本方法将表情符号转换为不会被误认为其他文字的文本(我使用dat$Temp <-前缀来确保这一点).然后使用qdap::strip去除标点符号,然后通过mgsub将占位符转换回图释:

Here's an approach that is less sophisticated and likely slower than @gagolews's solution. It requires you feed it an emoticon dictionary. You can create that or use the one in the qdapDictionaries package. The basic approach converts the emoticons to text that couldn't be mistaken for anything else (I use dat$Temp <- prefix to ensure this). Then you strip out punctuation using qdap::strip and then convert the placeholders back into emoticons via mgsub:

library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)

reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))

查看:

truncdf(left_just(dat[, 3, drop=F]), 50)

##   Temp                                              
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service 
## 3 ãããæããããéãããæãããInappropriate announce:-(         
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o

编辑:要按要求保留?!,请在strip函数中传递char.keep参数:

EDIT: To keep the ? and ! as requested pass the char.keep argument in strip function:

dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

这篇关于删除标点符号,但保留表情符号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆