应用 tm 方法“stemCompletion"时一个变量的多个结果; [英] multiple results of one variable when applying tm method "stemCompletion"

查看:47
本文介绍了应用 tm 方法“stemCompletion"时一个变量的多个结果;的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个语料库,其中包含 3 个变量(ID、标题、摘要)的 15 个观察值的期刊数据.使用 R Studio,我从 .csv 文件中读取了数据(每个观察一行).在执行一些文本挖掘操作时,我在使用方法 stemCompletion 时遇到了一些麻烦.在应用 stemCompletion 后,我观察到为 .csv 的每个词干行提供了三次结果.所有其他 tm 方法(例如 stemDocument)仅产生一个结果.我想知道为什么会发生这种情况以及如何解决这个问题

I have a corpus containing journal data of 15 observations of 3 variables (ID, title, abstract). Using R Studio I read in the data from a .csv file (one line per observation). When performing some text mining operations I got some trouble when using the method stemCompletion. After applying stemCompletion I observed that the results are provided for each stemmed line of the .csv three times. All the other tm methods (e.g. stemDocument) produce only a single result. I'm wondering why this happens and how I could fix the problem

我使用了以下代码:

data.corpus <- Corpus(DataframeSource(data))  
data.corpuscopy <- data.corpus
data.corpus <- tm_map(data.corpus, stemDocument)
data.corpus <- tm_map(data.corpus, stemCompletion, dictionary=data.corpuscopy) 

应用stemDocument后的单个结果是例如

The single results after applying stemDocument is e.g.

"> data.corpus[[1]]

physic environ   sourc  innov investig  attribut  innov space
          investig  physic space intersect  innov  innov     relev attribut  physic space   innov        reflect  chang natur  innov  technolog advanc  servic  mean chang  argu   develop  innov space similar embodi  divers set  valu   collabor open  sustain use  literatur review interview  benchmark    examin  relationship  physic environ  innov         literatur review   interview underlin innov   communic  human centr process   result five attribut  innov space  present collabor enabl modifi smart attract   reflect       provid perspect   challeng    support innov creation  develop physic space   add   conceptu develop  innov space  outlin physic space   innov servic"

使用stemCompletion后,结果出现了3次:

And after using stemCompletion the reults appear three times:

"$`1`
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service
physical environment source innovation investigation attributes innovation space investigation physical space intersect innovation innovation relevant attributes physical space innovation reflect changes nature innovation technological advancements service meanwhile changes argues develop innovation space similarity embodies diversified set valuable collaboration open sustainability used literature review interviews benchmarking examine relationships physical environment innovation literature review interviews underline innovation communicative human centred processes result five attributes innovation space present collaboration enablers modifiability smartness attractiveness reflect provide perspectives challenge support innovation creation develop physical space addition conceptual develop innovation space outlines physical space innovation service"

以下是一个可重现的示例:

Below is a sample as a reproducable example:

包含三个变量的三个观察结果的 .csv 文件:

A .csv file containing three observations of three variables:

ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations

下面是我用过的词干提取方法

And below is the stemming method that I've used

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]

corpus <- tm_map(corpus, stemCompletion, dictionary=corpuscopy)
inspect(corpus[1:3])

在我看来这取决于 .csv 中使用的变量数量,但我不知道为什么.

It seems to me like it depends on the number of variables used in the .csv but I have no idea why.

推荐答案

stemCompletion 函数似乎有些奇怪.在tm 0.6 版中如何使用stemCompletion 并不明显.有一个很好的解决方法这里,我已用于此答案.

There seems to be something odd about the stemCompletion function. It's not obvious how to use stemCompletion in the tm version 0.6. There is a nice workaround here that I've used for this answer.

首先,制作您拥有的 CSV 文件:

First, make the CSV file that you have:

dat <- read.csv2( text = 
                  "ID;Text A;Text B
1;Below is the first title;Innovation and Knowledge Management
2;And now the second Title;Organizational Performance and Learning are very important
3;The third title;Knowledge plays an important rule in organizations")

write.csv2(dat, "Test.csv", row.names = FALSE)

读入,转换为语料库,然后词干:

Read it in, transform to a corpus, and stem the words:

data = read.csv2("Test.csv")
data[,2]=as.character(data[,2])
data[,3]=as.character(data[,3])

corpus <- Corpus(DataframeSource(data)) 
corpuscopy <- corpus
library(SnowballC)
corpus <- tm_map(corpus, stemDocument)

看看它是否有效:

inspect(corpus)

<<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
1
Below is the first titl
Innovat and Knowledg Manag

[[2]]
<<PlainTextDocument (metadata: 7)>>
2
And now the second Titl
Organiz Perform and Learn are veri import

[[3]]
<<PlainTextDocument (metadata: 7)>>
3
The third titl
Knowledg play an import rule in organ

这是让 stemCompletion 正常工作的好方法:

Here's the nice workaround to get stemCompletion working:

stemCompletion_mod <- function(x,dict=corpuscopy) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}

检查输出以查看词干是否完成:

Inspect the output to see if the stems were completed ok:

lapply(corpus, stemCompletion_mod)

[[1]]
<<PlainTextDocument (metadata: 7)>>
1 Below is the first title Innovation and Knowledge Management

[[2]]
<<PlainTextDocument (metadata: 7)>>
2 And now the second Title Organizational Performance and Learning are NA important

[[3]]
<<PlainTextDocument (metadata: 7)>>
3 The third title Knowledge plays an important rule in organizations

成功!

这篇关于应用 tm 方法“stemCompletion"时一个变量的多个结果;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆