制作wordcloud,但结合单词? [英] Making a wordcloud, but with combined words?

查看:143
本文介绍了制作wordcloud,但结合单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力使出版物关键字构成一个词云.例如: 教育数据挖掘;合作学习;计算机科学...等

I am trying to make a word cloud of publications keywords. for example: Educational data mining; collaborative learning; computer science...etc

我当前的代码如下:

KeywordsCorpus <- Corpus(VectorSource(subset(Words$Author.Keywords, Words$Year==2012)))
KeywordsCorpus <- tm_map(KeywordsCorpus, removePunctuation)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeNumbers)

# added tolower
KeywordsCorpus <- tm_map(KeywordsCorpus, tolower)
KeywordsCorpus <- tm_map(KeywordsCorpus, removeWords, stopwords("english"))

# moved stripWhitespace
KeywordsCorpus <- tm_map(KeywordsCorpus, stripWhitespace)  

KeywordsCorpus <- tm_map(KeywordsCorpus, PlainTextDocument)

dtm4 <- TermDocumentMatrix(KeywordsCorpus)
m4 <- as.matrix(dtm4)
v4 <- sort(rowSums(m4),decreasing=TRUE)
d4 <- data.frame(word = names(v4),freq=v4)

但是,使用此代码,它可以将每个单词单独拆分,但是我需要的是将单词/短语组合在一起.例如:教育数据挖掘是我需要显示的1个短语,而不是正在发生的事情:教育",数据",挖掘".有没有办法将每个单词的复合词保持在一起?分号可能有助于分隔.

However, With this code, it is splitting each word by itself, But what I need is to have a combined words/phrases. For instance: Educational Data Mining is 1 phrase that I need to show instead of what is happening: "Educational" "Data" "Mining". Is there a way to keep each compound of words together? the semi-colon might help as a separator.

谢谢.

推荐答案

这是使用不同文本包的解决方案,它使您可以从统计检测到的搭配中或仅通过形成所有二元语法来形成多词表达式.该软件包称为 quanteda .

Here's a solution using a different text package, that allows you to form multi-word expressions from either statistically detected collocations, or just by forming all bi-grams. The package is called quanteda.

library(quanteda)
packageVersion("quanteda")
## [1] ‘0.9.5.14’

首先,该方法用于检测前1,500个bigram搭配,并用单令牌版本(由"_"字符连接)替换文本中的这些搭配.在这里,我使用的是美国总统就职演说文本的内置包装.

First, the method for detecting the top 1,500 bigram collocations, and replacing these collocations in the texts with their single-token versions (concatenated by the "_" character). Here I am using the package's built-in corpus of the US presidential inaugural address texts.

### for just the top 1500 collocations
# detect the collocations
colls <- collocations(inaugCorpus, n = 1500, size = 2)

# remove collocations containing stopwords
colls <- removeFeatures(colls, stopwords("SMART"))
## Removed 1,224 (81.6%) of 1,500 collocations containing one of 570 stopwords.

# replace the phrases with single-token versions
inaugCorpusColl2 <- phrasetotoken(inaugCorpus, colls)

# create the document-feature matrix
inaugColl2dfm <- dfm(inaugCorpusColl2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... indexing features: 9,741 feature types
## ... removed 430 features, from 570 supplied (glob) feature types
## ... complete. 
## ... created a 57 x 9311 sparse dfm
## Elapsed time: 0.163 seconds.

# plot the wordcloud
set.seed(1000)
png("~/Desktop/wcloud1.png", width = 800, height = 800)
plot(inaugColl2dfm["2013-Obama", ], min.freq = 2, random.order = FALSE, 
     colors = sample(colors()[2:128]))
dev.off()

这将导致以下绘图.注意这些搭配,例如"generation's_task"和"fellow_americans".

This results in the following plot. Note the collocations, such as "generation's_task" and "fellow_americans".

由所有双字母组形成的版本更容易,但是会导致大量的低频双字母组功能.对于云"一词,我选择了更多的文本,而不仅仅是2013年的奥巴马演讲.

The version formed with all bigrams is easier, but results in a huge number of low frequency bigram features. For the word cloud, I selected a larger set of texts, not just the 2013 Obama address.

### version with all bi-grams
inaugbigramsDfm <- dfm(inaugCorpusColl2, ngrams = 2, ignoredFeatures = stopwords("SMART"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 57 documents
## ... removed 54,200 features, from 570 supplied (glob) feature types
## ... indexing features: 64,108 feature types
## ... created a 57 x 9908 sparse dfm
## ... complete. 
## Elapsed time: 3.254 seconds.

# plot the bigram wordcloud - more texts because for a single speech, 
# almost none occur more than once
png("~/Desktop/wcloud2.png", width = 800, height = 800)
plot(inaugbigramsDfm[40:57, ], min.freq = 2, random.order = FALSE, 
     colors = sample(colors()[2:128]))
dev.off()

这将产生:

这篇关于制作wordcloud,但结合单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆