从文本中删除所有标点符号,包括 tm 包的撇号 [英] Remove all punctuation from text including apostrophes for tm package

查看:61
本文介绍了从文本中删除所有标点符号,包括 tm 包的撇号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由推文(只是消息文本)组成的向量,我正在清理这些推文以进行文本挖掘.我已经使用 tm 包中的 removePunctuation 像这样:

I have a of vector consisting of Tweets (just the message text) that I am cleaning for text mining purposes. I have used removePunctuation from the tm package like so:

clean_tweet_text = removePunctuation(tweet_text)

这导致向量中的所有标点符号都从文本中删除,除了撇号,这破坏了我的关键字搜索,因为未注册涉及撇号的单词.例如,我的关键字之一是 climate 但如果推文包含 'climate,则不会被计算在内.

This have resulted in a vector with all punctuation removed from the text except apostrophes, which ruins my keyword searches because words touching apostrophes are not registered. For example, one of my keywords is climate but if a tweet has 'climate it won't be counted.

如何从向量中删除所有撇号/单引号?

How can I removes all the apostrophes/single quotes from my vector?

以下是来自 dput 的标题,用于重现示例:

Here is the header from dput for a reproducible example:

c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap", 
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…", 
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…", 
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl", 
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck", 
"unusual warming kills gulf of maine cod  discovery news globalwarming  httpstco39uvock3xe", 
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", 
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
)

推荐答案

要删除所有标点符号(包括撇号和单引号),只需使用 gsub():

To remove all punctuation (including apostrophes and single quotes), you can just use gsub():

x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
       "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
       "rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
       "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
       "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
       "ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
       "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
       "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
       "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")

gsub("[[:punct:]]", "", x)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

gsub() 将其第三个参数中所有出现的第一个参数替换为其第二个参数(请参阅 help("gsub")).在这里,这意味着它将集合 [[:punct:]] 中任何字符在我们的向量 x 中的所有出现替换为 "">(删除它们).

gsub() replaces all occurrences of its first argument in its third argument with its second argument (see help("gsub")). Here, that means it replaces all occurrences in our vector x of any of the characters in the set [[:punct:]] with "" (remove them).

删除了哪些字符?来自 help("regex"):

What characters does that remove? From help("regex"):

[:punct:]

    标点符号:
    !" # $ % & ' ( ) * + , - ./: ; < = > ? @ [ \ ] ^ _ ` { | } ~.

    Punctuation characters:
    ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

更新

出现这种情况是因为您的撇号类似于 ' 而不是 '.所以,如果你想坚持使用 tm::removePunctuation(),你也可以使用

Update

It appears this occurs because your apostrophes are like instead of like '. So, if you want to stick with tm::removePunctuation(), you can also use

tm::removePunctuation(x, ucp = TRUE)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

这篇关于从文本中删除所有标点符号,包括 tm 包的撇号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆