为什么stemDocument不起作用? [英] Why isn't stemDocument stemming?

查看:135
本文介绍了为什么stemDocument不起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在R中使用'tm'包使用词干术语创建术语文档矩阵.该过程已经完成,但是结果矩阵中包含的词条似乎没有被阻止,而我试图理解为什么会这样,以及如何解决它.

I am using the 'tm' package in R to create a term document matrix using stemmed terms. The process is completing, but the resulting matrix includes terms that don't appear to have been stemmed, and I'm trying to understand why that is and how to fix it.

以下是该过程的脚本,该脚本使用几个在线新闻故事作为沙箱:

Here is the script for the process, which uses a couple of online news stories as the sandbox:

library(boilerpipeR)
library(RCurl)
library(tm)

# Pull the relevant parts of the news stories using 'boilerpipeR' and 'RCurl'
url <- "http://blogs.wsj.com/digits/2015/07/14/google-mozilla-disable-flash-over-security-concerns/"
extract <- LargestContentExtractor(getURL(url))
url2 <- "http://www.cnet.com/news/startup-lands-100-million-to-challenge-smartphone-superpowers-apple-and-google/"
extract2 <- LargestContentExtractor(getURL(url2))

# Now put those text vectors in a corpus and create a tdm
news.corpus <- VCorpus(VectorSource(c(extract, extract2)))
news.tdm <- TermDocumentMatrix(news.corpus,
  control = list(removePunctuation = TRUE,
                 stopwords = TRUE,
                 stripWhitespace = TRUE,
                 stemDocument = TRUE))

# Now inspect the result
findFreqTerms(news, 4)

这是最后一行产生的输出:

Here is the output that last line produces:

[1] "acadine"       "adobe"         "android"       "browser"       "challenge"     "companies"     "company"       "devices"       "firefox"       "flash"        
[11] "funding"       "gong"          "hackers"       "international" "ios"           "like"          "million"       "mobile"        "mozilla"       "mozillas"     
[21] "new"           "online"        "operating"     "said"          "security"      "smartphones"   "software"      "startup"       "system"        "systems"      
[31] "tsinghua"      "unigroup"      "used"          "users"         "videos"        "web"           "will"  

例如,在第1行中,我们看到公司"和公司",并且看到设备".我以为词干可以将"company"和"company"减少到相同的词干("compan"?),并且我认为它可以将"s"减少为"devices"之类的复数形式.我说错了吗?如果没有,为什么这段代码不能在这里产生预期的结果?

In line 1, for example, we see "companies" and "company", and we see "devices". I thought stemming would reduce "companies" and "company" to the same stem ("compani"?), and I thought it would trim the "s" off plurals like "devices". Am I wrong about that? If not, why isn't this code producing the desired result here?

推荐答案

使用stemming = TRUEstemming = stemDocument代替stemDocument = TRUE. (?termFreq显示stemDocument不是有效的控制参数.)

Use stemming = TRUE or stemming = stemDocument instead of stemDocument = TRUE. (?termFreq shows that stemDocument is no valid control parameter.)

这篇关于为什么stemDocument不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆