R文本挖掘:使用tm包中的stemDocuments对相似词进行分组 [英] R text mining: grouping similar words using stemDocuments in tm package
问题描述
我正在对大约 30000 条推文进行文本挖掘,现在的问题是让结果更可靠,我想将同义词"转换为类似的词,例如.一些用户使用单词girl",一些用户使用girls",一些用户使用gal".同样,给予"、给予"仅表示一件事."come,"came" 也一样.一些用户使用简写形式,如 "plz","pls" 等.此外,来自 tm 包的stemdocument"无法正常工作.它正在将舞蹈转换为 danc,将 table 转换为 tabl..... 是否有任何其他用于词干的好包.我想用一个相似的词替换所有这些词,以便计算这些数据的正确频率.所以我的情绪分析会更可靠.以下是可重现的代码(我不能在此处包含所有 30000X1 数据框),在 ken 评论后对其进行了
I am doing text mining of around 30000 tweets, Now the problem is to make results more reliable i want to convert "synonyms" to similar words for ex. some user use words "girl", some use "girls", some use "gal". similarly "give","gave" means only one thing. same for "come,"came". some user use short-form like "plz","pls" etc. Also, "stemdocument" from tm package is not working properly. it's is converting dance to danc, table to tabl.....is there any other good package for stemming. I want to replace all these words by just one similar words, in order to count the correct frequency of this data. So my sentiment analysis would be more reliable. Following is the reproducible code (i cannot include all 30000X1 dataframe here), edited it after comments by ken:
content<-c("n.n.t.t.t.t.t.t.girl.do.it.to.me.t.t.n.t.t.t.t.t.t.n.n.t.t.t.t.t.t.n.n.t.t.t.t.t.t.t.n.n.t.t.t.t.t.t.t.n.t.n.t.t.n.t.t.t.n.t.t.t.tajinkx.said..n.t.t.t.n.t.t.n.t.n.t.n.t.t.n.t.t.n.t.t.n.t.t.tok.guyz...srry.to.sound.dumb.toilets.i.dnt.drink.while.m.just.searching.for.fun..nso.is.going.to.bar.good.for.me.i.dnt.knw.what.washroom.all.happens.there.inside...so.would.like.if.someone.gals.helps.me.thankuu..n.t.t.n.t.t.t.tClick.to.expand....n.t.nBhai.tu.plz.rehne.de.....n.n.t.n.n.t.t.n.t.t.t.n.t.t.n.n.t.t.n.t.n.n.t.t.t.t.t.t.t.t..n.t.t.t.t.t.t.t.t.n.toilet.is.not .t.t.t.t.t.t.t.n.n.t.t.t.t.t.t.n.n.t.t.t.t.t.t.n.t.n.n.t.t.n.t.t.t.n.t.t.n.n.t.t.n.t.n.n.n.t.n.n.n.t.n.n.t.t.n.t.t.t.n.t.t.n.n.t.t.n.t.n.n.t.t.t.t.t.................................................................................................................................................. \n\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\n\t\n\t\t\n\t\t\t\n\t\t\t\tajinkx said:\n\t\t\t\n\t\t\n\t\n\t\n\t\t\n\t\t\n\t\t\n\t\t\tok guyz...srry to sound dumb!i dnt drink while m just searching for fun!\nso is going to bar good for me?i dnt knw what all happens there inside...so would like if someone helps me.thankuu!\n\t\t\n\t\t\t\tClick to expand...\n\t\nBhai,tu plz rehne de....\n\n\t\n\n\t\t\n\t\t\t\n\t\t\n\n\t\t\n\t\n\n\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\t\n\n\t\t\n\t\t\t\n\t\t\n\n\t\t\n\t\n\n\n\t\n\n\n\t\n\n\t\t\n\t\t\t\n\t\t\n\n\t\t\n\t\n\n\t\t\t\t\t\n\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t is this da bar which u guys r talking about???\nSent from my SM-N900 using Tapatalk\n\n\t\n\n\t\t\n\t\t\t\n\t\t\n\n\t\t\n\t\n\n\t\t\t\t\t\t\t\t \n\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\n\t\t\t\t\t\t\n\t\n\n\t")
np<-600;postop<-1200;fr<-"yes"#input from GUI
#wbpage<-function (np,postop,fr){
#load("data_mpdb.Rdata")
#content<-as.data.frame(raw_dat[np:postop,],stringsAsFactors = FALSE)
#last<-rbind(tail(content,1),head(content,1));colnames(last)<-#c("stopdate","startdate")
message("Initializing part-1")
#---------------------data cleaning-----------------------------------------------------
#replied post
content2<-as.data.frame(content$txt,stringsAsFactors = FALSE);colnames(content2)<-c("txt")
content2 <- as.data.frame(gsub("(said:).*?(click to expand\\.{3})", " ", content$txt),stringsAsFactors = FALSE);
content2<-as.data.frame(lapply(content$txt, gsub, pattern = '(said:).*?(click to expand\\.{3})', replacement ="\\1 \\2", perl=TRUE),stringsAsFactors = FALSE);
content2<- as.data.frame(t(as.matrix(content2)));colnames(content2)<-c("txt");rownames(content2)<-NULL
#----------------ken's addition: lemmitization---------------------------
sp <- spacy_parse(as.character(content2$txt), lemma = TRUE)
sp$token <- ifelse(!grepl("^\\-[A-Z]+\\-$", sp$lemma), sp$lemma, sp$token)
# define equivalencies for please variants
dict <- dictionary(list(
please = c("please", "pls", "plz"),
girl = c("girl", "gal"),
toilet=c("toilet","shit","shitty","washroom")
))
toks <- as.tokens(sp) %>%
tokens(remove_punct = TRUE)
toks
new_stopwords<-c("said","one","click","expand","sent","using","attachment",
"tapatalk","will","can","hai","forum","like","just",
"get","know","also","now","bro","bhai","back","wat",
"ur","naa","nai","sala","email","urself","arnd","sim",
"pl","kayko","ho","gmail","sm","ll","g7102","iphone","yeah","time","asked","went","want","look","call","sit",
"even","first","place","left","visit","guy","around","started","came","dont","got","took","see","take","see","come")
toks <- tokens_remove(toks, c(stopwords("en"), new_stopwords))
#--------I have to make toks to be same as content2 so that i can use it in #
further corpus buildin---------------------------
#the data- punctuation, digits, stopwords, whitespace, and lowercase.
docs <- Corpus(VectorSource(content2$txt));#mname<-Corpus(VectorSource(content2$name))
message("Initializing part-1.2")
docs <- tm_map(docs, content_transformer(tolower));#mname<-tm_map(mname,content_transformer(tolower))
docs <- tm_map(docs, removePunctuation,preserve_intra_word_contractions=TRUE,preserve_intra_word_dashes=TRUE);#mname <- tm_map(mname, removePunctuation)
message("Initializing part-1.3")
docs <- tm_map(docs, removeWords, c(stopwords("english"),new_stopwords))
docs <- tm_map(docs, stripWhitespace);#mname <- tm_map(mname, stripWhitespace)
message("Initializing part-1.4")
docs <- tm_map(docs, removeWords,new_stopwords)
#------------------------Text stemming------------------------------------------
#docs <- tm_map(docs, stemDocument,language="english")
#-------------sentiment analysis--------------------------------------------------
message("Initializing part-2")
n <- 4
rnorm(10000, 0,1)
#incProgress(1/n, detail = paste("Finished section 1"))
docs_df <- data.frame(matrix(unlist(docs),nrow=length(docs), byrow=F),stringsAsFactors=FALSE)
docs_df<-docs_df[-c(2)];content2$editedtxt<-docs_df;
#----------------fr|fr:----------------------------------------------
if (fr=="yes"){
frlogic<-grepl("fr\\s|fr:", docs_df$X1);docs_df<-as.data.frame(docs_df[frlogic=="TRUE",],stringsAsFactors = FALSE);
docs_df[order(nchar(as.character(docs_df)),decreasing = FALSE),]
}
colnames(docs_df)<-c("txt")
d<-get_nrc_sentiment(as.character(docs_df))
td<-data.frame(t(d))
td_new <- data.frame(rowSums(td))
#Transformation and cleaning
names(td_new)[1] <-"count"
td_new <- cbind("sentiment"=rownames(td_new), td_new)
rownames(td_new) <- NULL
td_new2<-td_new[1:8,]
sentimentplot<-qplot(sentiment, data=td_new2, weight=count, geom="bar",fill=sentiment)+ggtitle("sentiments")
sentimentplot
现在我遇到了错误 Find a python executable with spaCy installed...set_spacy_python_option(python_executable, virtualenv, condaenv, 中的错误:在系统PATH中没有找到python
right now i am getting the erro Finding a python executable with spaCy installed... Error in set_spacy_python_option(python_executable, virtualenv, condaenv, : No python was found on system PATH
还有,
I have to make toks to be same as content2 so that i can use it in #
further corpus building for furhter analysis.
等待您的答复.谢谢.
推荐答案
该代码不可重现,因为我们没有输入 content2
.但这里有一个您可以使用的示例.
That code is not reproducible, since we don't have the input content2
. But here's an example that you can use.
你所说的变体的转换同义词",比如give"和gave"或girl"与girls"不仅仅是词干的问题,而是词形还原的问题(例如give-gave).为了进行词形还原,您需要 tm 包中不存在的功能.
What you call "converting synonyms" for variants, like "give" and "gave" or "girl" versus "girls" is not just a matter of stemming, it's a matter of lemmatization (for the give-gave for instance). To lemmatize, you need functionality not present in the tm package.
我建议您尝试使用 spacyr 进行词形还原,其余部分使用 quanteda.就是这样.我们从一些文本开始,然后使用 spacy_parse()
解析它.
I recommend you try spacyr for lemmatization, and quanteda for the rest. Here's how. We start with some text, and then parse it using spacy_parse()
.
txt <- c(
"The girl and the girls gave all they had to give.",
"Pls say plz, please, gal."
)
new_stopwords <- c(
"yeah", "time", "asked", "went", "want", "look", "call",
"sit", "even", "first", "place", "left", "visit", "guy",
"around", "started", "came", "dont", "got", "took", "see",
"take", "see", "come"
)
library("spacyr")
sp <- spacy_parse(txt, lemma = TRUE)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.2.3, language model: en_core_web_sm)
## (python options: type = "condaenv", value = "spacy_condaenv")
sp
## doc_id sentence_id token_id token lemma pos entity
## 1 text1 1 1 The the DET
## 2 text1 1 2 girl girl NOUN
## 3 text1 1 3 and and CCONJ
## 4 text1 1 4 the the DET
## 5 text1 1 5 girls girl NOUN
## 6 text1 1 6 gave give VERB
## 7 text1 1 7 all all DET
## 8 text1 1 8 they -PRON- PRON
## 9 text1 1 9 had have AUX
## 10 text1 1 10 to to PART
## 11 text1 1 11 give give VERB
## 12 text1 1 12 . . PUNCT
## 13 text2 1 1 Pls pls INTJ
## 14 text2 1 2 say say VERB
## 15 text2 1 3 plz plz INTJ
## 16 text2 1 4 , , PUNCT
## 17 text2 1 5 please please INTJ
## 18 text2 1 6 , , PUNCT
## 19 text2 1 7 gal gal PROPN
## 20 text2 1 8 . . PUNCT
我们将把它转换成 quanteda 标记,但首先让我们用它的引理替换标记(除非它是词性标识符,如-PRON-").
We're going to convert this into quanteda tokens, but first let's replace the token with its lemma (unless it's a part of speech identifier, like "-PRON-").
# replace the token with its lemma (unless it's "-PRON-" for instance)
sp$token <- ifelse(!grepl("^\\-[A-Z]+\\-$", sp$lemma), sp$lemma, sp$token)
对于您的俚语变体,我们需要手动定义等效项,我们可以使用quanteda字典"来完成.
For your slang variations, we need to define equivalencies manually, which we can do using a quanteda "dictionary".
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.0.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
# define equivalencies for please variants
dict <- dictionary(list(
please = c("please", "pls", "plz"),
girl = c("girl", "gal")
))
我们将在一分钟内使用它.首先,让我们从 spacyr 解析的输出中创建一个令牌对象,并删除标点符号.
We'll use that in a minute. First, let's create a tokens object from the spacyr parsed output, and remove punctuation.
toks <- as.tokens(sp) %>%
tokens(remove_punct = TRUE)
toks
## Tokens consisting of 2 documents.
## text1 :
## [1] "the" "girl" "and" "the" "girl" "give" "all" "they" "have" "to"
## [11] "give"
##
## text2 :
## [1] "pls" "say" "plz" "please" "gal"
使用 tokens_remove()
函数可以轻松删除停用词.
Removing stopwords is easy, with the tokens_remove()
function.
# now remove stopwords
toks <- tokens_remove(toks, c(stopwords("en"), new_stopwords))
toks
## Tokens consisting of 2 documents.
## text1 :
## [1] "girl" "girl" "give" "give"
##
## text2 :
## [1] "pls" "say" "plz" "please" "gal"
现在为了使girl"和please"等价,我们使用tokens_lookup()
:
And now to make the variations of "girl" and "please" equivalent, we use tokens_lookup()
:
toks <- tokens_lookup(toks, dictionary = dict, exclusive = FALSE, capkeys = FALSE)
toks
## Tokens consisting of 2 documents.
## text1 :
## [1] "girl" "girl" "give" "give"
##
## text2 :
## [1] "please" "say" "please" "please" "girl"
对于情感分析,您可以再次使用 tokens_lookup()
应用情感字典,并从中创建 dfm(文档特征矩阵).(注意:说"并不是一个真正的否定词,但我在此处使用它作为示例.)
For sentiment analysis, you could apply a sentiment dictionary using tokens_lookup()
again, and create dfm (document-feature matrix) from this. (Note: "say" is not really a negative word, but I am using it as such for an example here.)
sentdict <- dictionary(list(
positive = c("nice", "good", "please", "give"),
negative = c("bad", "say")
))
tokens_lookup(toks, dictionary = sentdict) %>%
dfm()
## Document-feature matrix of: 2 documents, 2 features (25.0% sparse).
## features
## docs positive negative
## text1 2 0
## text2 3 1
这篇关于R文本挖掘:使用tm包中的stemDocuments对相似词进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!