(R) 关于 DocumentTermMatrix 中的停用词 [英] (R) About stopwords in DocumentTermMatrix

查看:36
本文介绍了(R) 关于 DocumentTermMatrix 中的停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些关于 DocumentTermMatrix() 及其停用词的问题.我输入如下,但无法得到我想要的结果.

I have some questions about DocumentTermMatrix() and about its stopwords. I typed as below, but couldn't get the results that I wanted.

text <- "text is my text but also his text."
mycorpus <- VCorpus(VectorSource(text))
mydtm <- DocumentTermMatrix(mycorpus, control=list(stopwords=F))
lapply(mycorpus, function(x){str_extract_all(x, boundary("word"))}) %>% unlist() %>% table()
.
also  but  his   is   my text 
   1    1    1    1    1    3 
apply(mydtm, 2, sum)
 also   but   his  text text. 
    1     1     1     2     1 

首先,即使我使用了 stopwords=F,dtm 仍然删除了一些停用词,例如is".然而,它并没有删除his",尽管它在stopwords("en")stopwords("SMART") 中都有列出.所以我真的不明白 DTM 使用什么停用词以及为什么 stopwords=F 不起作用.我应该怎么做才能让它发挥作用?

First is that even though I used stopwords=F, the dtm still removed some stopwords such as "is." However, it didn't remove "his" although it is listed in both stopwords("en") and stopwords("SMART"). So I really don't understand what stopwords that DTM uses and why stopwords=F doesn't work. and What should I do to make it work?

推荐答案

您可以尝试替代软件包:quanteda.它允许您在标记化后或在创建文档特征矩阵后删除停用词.下面,我使用 pad = TRUE 只是为了显示匹配停用词的标记已被删除的插槽.

You could try an alternative package: quanteda. It allows you to remove stopwords after tokenizing, or after creating the document-feature matrix. Below, I used pad = TRUE simply to show the slots where the tokens matching stopwords have been removed.

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

text <- "text is my text but also his text."

tokens(text) %>%
  tokens_remove(stopwords("en"), pad = TRUE)
## tokens from 1 document.
## text1 :
## [1] "text" ""     ""     "text" ""     "also" ""     "text" "."

或者:

dfm(text)
## Document-feature matrix of: 1 document, 7 features (0.0% sparse).
## 1 x 7 sparse Matrix of class "dfm"
##        features
## docs    text is my but also his .
##   text1    3  1  1   1    1   1 1

dfm(text, remove_punct = TRUE) %>%
  dfm_remove(stopwords("en"))
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##        features
## docs    text also
##   text1    3    1

英文停用词列表只是由 stopwords() 函数(实际上来自 stopwords 包)返回的字符向量.默认英文列表与 tm::stopwords("en") 相同,除了 tm 包包含will".(如果你想要 SMART 列表,它是 stopwords("en", source = "smart").)

The list of English stopwords is just a character vector returned by the stopwords() function (which actually comes from the stopwords package). The default English list is the same as tm::stopwords("en") except the tm package includes "will". (If you want the SMART list, it's stopwords("en", source = "smart").)

stopwords("en")
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
## [101] "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"        
## [111] "the"        "and"        "but"        "if"         "or"        
## [116] "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"     
## [126] "against"    "between"    "into"       "through"    "during"    
## [131] "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"       
## [141] "on"         "off"        "over"       "under"      "again"     
## [146] "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"       
## [156] "any"        "both"       "each"       "few"        "more"      
## [161] "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"      
## [171] "so"         "than"       "too"        "very"       "will"

这篇关于(R) 关于 DocumentTermMatrix 中的停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆