如何使用stemCompletion函数(tm包)从字典中完成一个词干语料库 [英] How to complete a stemmed corpus from a dictionary using stemCompletion function (tm package)

查看:69
本文介绍了如何使用stemCompletion函数(tm包)从字典中完成一个词干语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 的 tm 包中遇到了问题.我使用的是 0.6.2 版本.以下问题(2 个不同的错误)已经在此处这里 但仍然产生使用发布的解决方案后出错.请点击此处下载数据集(仅限 93 行).这是一个可重复的例子.两个错误如下:

I am having a trouble in the tm package of R. I am using 0.6.2 version. Following question (2 different errors) has already been answered here and here but still producing an error after using the posted solution. Please click here to download the dataset (93 rows only). It's a reproducible example. the two errors are below:

  1. (已解决) UseMethod("meta", x) 中的错误:没有适用于元"的方法应用于字符"类的对象

  1. (Resolved) Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"

错误:inherits(doc, "TextDocument") 不正确

Error: inherits(doc, "TextDocument") is not TRUE

请告诉我我的方法有什么问题.

please tell me what is wrong in my approach.

--

  # Data import
    df.imp<- read.csv("Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

   ##### Data Pre-Processing 

        install.packages("tm")
    require(tm)  

    ds.corpus<- Corpus(VectorSource(df.imp$Content))

    ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
    ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
    removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
    ds.corpus<- tm_map(ds.corpus,removeURL)

    stopwords.default<- stopwords("english")
    stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                            "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                            "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                            "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

    stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
    ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

    copy<- ds.corpus ## creating a copy to be used as a dictionary

    ds.corpus<- tm_map(ds.corpus, stemDocument)

    ## error Statement #1
    ds.corpus<-  stemCompletion(ds.corpus, dictionary = copy) 
    ## Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"




    ds.cleanCorpus<- tm_map(ds.corpus, PlainTextDocument) ## creating plain text document

    class(ds.cleanCorpus) ## output is VCorpus" "Corpus".  what it should be??

    ## error Statement #2
    tdm<- TermDocumentMatrix(ds.corpus) ## creating  term document matrix 

    inherits(ds.cleanCorpus, "TextDocument") ## returns FALSE

更新:找出第一个错误,stemCompletion 方法的 x 参数应该是字符向量,而字典可以是语料库或字符向量.但是,当我在 ds.corpus 的第一个文档(字符向量)上尝试时,如下所示,词干没有完成,输出只是像以前一样的词干向量.

Update: Figured out first error, that the stemCompletion method's x parameter should be a character vector and dictionary could be either a corpus or character vector. However, when I tried it on first document (character vector) of ds.corpus, as below, stemmed words were not completed and output is just the stemmed character vector like before.

stemCompletion(ds.corpus[[1]]$content, dictionary = copy) 

所以现在我的主要问题是如何从字典(tm 包)中完成一个词干语料库?"stemCompletion 方法似乎不起作用(在字符向量上).其次,如何完成整个语料库的词干提取,是否应该对语料库内容的每个文档都使用for循环?

So now my main question is "How to complete a stemmed corpus from a dictionary (tm package)?" The stemCompletion method doesn't seems working (on a character vector). Secondly, how can I complete the stemming of an entire corpus, should I use a for loop for each document of the corpus's content?

推荐答案

你需要改变两件事

  1. 当你使用自定义函数时,你需要使用 content_transformer

  1. When you use a custom function you need to use content_transformer

removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)

ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))

ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))

stemCompletion 函数的目的是尝试完成一个词干https://en.wikipedia.org/wiki/Stemming 基于字典.词干需要是一个字符向量,字典可以是一个语料库.

The purpose of the function stemCompletion is to try to complete a stemmed word https://en.wikipedia.org/wiki/Stemming based on a dictionary. The stemmed words need to be a character vector and dictionary can be a corpus.

x <- c("compan", "entit", "suppl")茎完成(x,复制)

x <- c("compan", "entit", "suppl") stemCompletion(x, copy)

输出:

 compan       entit       suppl 

公司""供应"

用于创建文档术语矩阵的代码

# Data import
df.imp<- read.csv("data/Phone2_Sample100_NegPos.csv", header = TRUE, as.is = TRUE)

##### Data Pre-Processing 

#install.packages("tm")
require(tm)  

ds.corpus<- Corpus(VectorSource(df.imp$Content))

ds.corpus<- tm_map(ds.corpus, content_transformer(tolower))
ds.corpus<- tm_map(ds.corpus, content_transformer(removePunctuation))
ds.corpus<- tm_map(ds.corpus, content_transformer(removeNumbers))
removeURL<- function(x) gsub("http[[:alnum:]]*", "", x)
ds.corpus<- tm_map(ds.corpus,content_transformer(removeURL))


stopwords.default<- stopwords("english")
stopWordsNotDeleted<- c("isn't" ,     "aren't" ,    "wasn't" ,    "weren't"   , "hasn't"    ,
                        "haven't" ,   "hadn't"  ,   "doesn't" ,   "don't"      ,"didn't"    ,
                        "won't"   ,   "wouldn't",   "shan't"  ,   "shouldn't",  "can't"     ,
                        "cannot"    , "couldn't"  , "mustn't", "but","no", "nor", "not", "too", "very")

stopWord.new<- stopwords.default[! stopwords.default %in% stopWordsNotDeleted] ## new Stopwords list
ds.corpus<- tm_map(ds.corpus, removeWords, stopWord.new )

tdm<- TermDocumentMatrix(ds.corpus)

完成词干词的示例

copy<- ds.corpus ## creating a copy to be used as a dictionary
x <- c("compan", "entit", "suppl")
stemCompletion(x, copy)

这篇关于如何使用stemCompletion函数(tm包)从字典中完成一个词干语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆