如何在tm字典中实现接近规则以计数字? [英] How to implement proximity rules in tm dictionary for counting words?

查看:213
本文介绍了如何在tm字典中实现接近规则以计数字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标



我想计算文档中出现love一词的次数, t前面加上不字我爱电影将被视为一个外观,而我不爱电影不会被视为外观。



问题



如何继续使用tm软件包?



R代码



下面是一些自我包含的代码,我想修改做以上。

  require(tm)

#文本向量
my.docs< ; - c(我爱红辣椒,他们是世界上最可爱的人,
我不喜欢红辣椒,但我也不讨厌他们,我想他们是OK.\\\

我讨厌红辣椒!)

#转换为data.frame
my.docs.df< - data.frame(docs = my.docs,row.names = c(positiveText,neutralText,negativeText),stringsAsFactors = FALSE)

#转换为语料库
my.corpus< - Corpus(DataframeSource(my.docs.df))

#一些标准的预处理
my.corpus < - tm_map(my.corpus,stripWhitespace)
my.corpus< - tm_map(my.corpus,tolower)
my.corpus< - tm_map(my.corpus,removePunctuation)
my.corpus&删除词,停用词(english))
my.corpus< - tm_map(my.corpus,stemDocument)
my.corpus< - tm_map(my.corpus,removeNumbers)

#构造字典
my.dictionary.terms< - tolower(c(love,Hate))
my.dictionary< - Dictionary(my.dictionary.terms)

#构造术语文档矩阵
my.tdm < - TermDocumentMatrix(my.corpus,control = list(dictionary = my.dictionary))
inspect(my.tdm)

#条件positiveText neutralText negativeText
#hate 0 1 1
#love 2 1 0

更多信息



我试图从商业包WordStat重现字典规则功能。它能够利用字典规则,即


层次内容分析字典或由
字组成的分类法,字模式,短语以及邻近规则(例如
NEAR,AFTER,BEFORE),用于实现概念的精确测量


em>另外,我注意到这个有趣的问题:基于开源规则的信息提取框架?






更新1:评论和帖子我得到这个(虽然稍有不同,最终它是强烈的灵感来自他的回答,所以完全信任他)

  require(data.table)
require(RWeka)

#bi-gram tokeniser function
BigramTokenizer< - function(x)NGramTokenizer(x,Weka_control(min = maxDb);

#获取所有1-gram和2-gram字计数
tdm < - TermDocumentMatrix(my.corpus,control = list(tokenize = BigramTokenizer))

#convert to data.table
dt< - as.data.table(as.data.frame(as.matrix(tdm)),keep.rownames = TRUE)
setkey(dt,rn)

#尝试提取,但包括重叠,即计数两次的单词
dt [like(rn,love)]
#rn positiveText neutralText negativeText
#1:i love 1 0 0
#2:love 2 1 0
#3:love peopl 1 0 0
#4:love the 1 1 0
#5:most love 1 0 0
#6:not love 0 1 0

我想我需要做一些行子设置和行减法,这将导致像

  dt1 < -  dt [love] 
#rn positiveText neutralText negativeText
#1:love 2 1 0

dt2 < - dt [like(rn,love)& (rn,not)]
#rn positiveText neutralText negativeText
#1:不爱0 1 0

#以某种方式做类似
#DT = dt1 - dt2
#但我不能解决如何编码,但require输出将是
#rn positiveText neutralText negativeText
#1:love 2 0 0

我不知道如何使用data.table获取最后一行,但是这种方法类似于WordStats的NOT NEAR 'dictionary function eg在这种情况下,只计算单词爱,如果它不出现在1字内直接之前或之后的单词不。



如果我们要做一个m-gram分词器,那么就好像我们只计算love一词,如果它不出现在m-1 )


b class =h2_lin>解决方案

这是一个关于collocation提取的有趣问题,它似乎没有内置到任何包中(除了这一个,而不是CRAN或github虽然),尽管它在语料库语言学很受欢迎。我认为这段代码将回答你的问题,但可能有一个比这更通用的解决方案。



以下是您的示例(感谢您的使用示例)

  ############## 
require(tm)

#text vector
my.docs< - c(I love the炽热的辣椒!他们是世界上最可爱的人,
我不爱爱红辣椒,但我不讨厌他们,我想他们是OK.\\\

我讨厌'红辣椒'!)

#转换为data.frame
my.docs.df < - data.frame = my.docs,row.names = c(positiveText,neutralText,negativeText),stringsAsFactors = FALSE)

#转换为语料库
my.corpus< - Corpus(DataframeSource(my.docs.df))

#一些标准的预处理
my.corpus< - tm_map(my.corpus,stripWhitespace)
my.corpus< ; - tm_map(my.corpus,tolower)
my.corpus< - tm_map(my.corpus,removePunctuation)
#'not'是一个停用词,所以不要删除
#my .corpus < - tm_map(my.corpus,removeWords,deadwords(english))
my.corpus< - tm_map(my.corpus,stemDocument)
my.corpus< - tm_map my.corpus,removeNumbers)

#构造字典 - 在这种情况下不使用
#my.dictionary.terms< - tolower(c(love,Hate))
#my.dictionary< - Dictionary(my.dictionary.terms)

这里是我的建议,创建一个两字母的文档术语矩阵并将其子集化

  #Tokenizer用于n元语法并传递给term-构造函数
库(RWeka)
BigramTokenizer< - function(x)NGramTokenizer(x,Weka_control(min = 2,max = 2))
txtTdmBi < - TermDocumentMatrix(my.corpus,控制=列表(tokenize = BigramTokenizer))
inspect(txtTdmBi)

#找到有'爱'的bigrams
love_bigrams< - txtTdmBi $ dimnames $ terms [grep (love,txtTdmBi $ dimnames $ Terms)]

#只保留bigrams其中'love'不是第一个单词
#,以避免计数'love'两次,
#基于前面的单词
require(Hmisc)
love_bigrams< - love_bigrams [sapply(love_bigrams,function(i)first.word(i))!='love']
#排除特定的二元组'不爱'
love_bigrams< - love_bigrams [!love_bigrams =='not love']

下面是结果,我们得到了爱的数量为2,它排除了不爱的两字母。

 #检查结果
inspect(txtTdmBi [love_bigrams])

一个term-document矩阵(2项,3个文档)

非 - /稀疏项:2/4
稀疏性:67%
最大长度:9
加权:术语频率(tf)

Docs
条件positiveText neutralText negativeText
i love 1 0 0
most love 1 0 0

#获得爱的计数(不包括不爱)
colSums(as.matrix(txtTdmBi [love_bigrams]))
positiveText neutralText negativeText
2 0 0


Objective

I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance.

Question

How would one proceed using the tm package?

R Code

Below is some self contained code which I would like to modify to do the above.

require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
          "I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
          "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)

# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)

# Terms  positiveText neutralText negativeText
# hate            0           1            1
# love            2           1            0

Further information

I am trying to reproduce the dictionary rules functionality from the commercial package WordStat. It is able to make use of dictionary rules i.e.

"hierarchical content analysis dictionaries or taxonomies composed of words, word patterns, phrases as well as proximity rules (such as NEAR, AFTER, BEFORE) for achieving precise measurement of concepts"

Also I noticed this interesting SO question: Open-source rule-based information extraction frameworks?


UPDATE 1: Based on @Ben's comment and post I got this (although slightly different at the end it is strongly inspired by his answer so full credit to him)

require(data.table)
require(RWeka)

# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))

# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)

# attempt at extracting but includes overlaps i.e. words counted twice 
dt[like(rn, "love")]
#            rn positiveText neutralText negativeText
# 1:     i love            1           0            0
# 2:       love            2           1            0
# 3: love peopl            1           0            0
# 4:   love the            1           1            0
# 5:  most love            1           0            0
# 6:   not love            0           1            0

Then I guess I would need to do some row sub-setting and row subtraction which would lead to something like

dt1 <- dt["love"]
#     rn positiveText neutralText negativeText
#1: love            2           1            0

dt2 <- dt[like(rn, "love") & like(rn, "not")]
#         rn positiveText neutralText negativeText
#1: not love            0           1            0

# somehow do something like 
# DT = dt1 - dt2 
# but I can't work out how to code that but the require output would be
#     rn positiveText neutralText negativeText
#1: love            2           0            0

I don't know how to get that last line using data.table but this approach would be akin to WordStats 'NOT NEAR' dictionary function e.g. in this case only count the word "love" if it deesn't appear within 1-word either directly before or directly after the word 'not'.

If we were to do an m-gram tokeniser then it would be like saying we only count the word "love" if it doesn't appear within (m-1)-words either side of the word "not".

Other approaches are most welcome!

解决方案

This is an interesting question about collocation extraction, which doesn't seem to be built into any packages (except this one, not on CRAN or github though), despite how popular it is in corpus linguistics. I think this code will answer your question, but there might be a more general solution than this.

Here's your example (thanks for the easy to use example)

##############
require(tm)

# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.", 
             "I do not `love` the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
             "I hate the `Red Hot Chilli Peppers`!")

# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)

# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))

# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
# 'not' is a stopword so let's not remove that
# my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)

# construct dictionary - not used in this case
# my.dictionary.terms <- tolower(c("love", "Hate"))
# my.dictionary <- Dictionary(my.dictionary.terms)

Here's my suggestion, making a document term matrix of bigrams and subsetting them

#Tokenizer for n-grams and passed on to the term-document matrix constructor
library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
txtTdmBi <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
inspect(txtTdmBi)

# find bigrams that have 'love' in them
love_bigrams <- txtTdmBi$dimnames$Terms[grep("love", txtTdmBi$dimnames$Terms)]

# keep only bigrams where 'love' is not the first word
# to avoid counting 'love' twice and so we can subset 
# based on the preceeding word
require(Hmisc)
love_bigrams <- love_bigrams[sapply(love_bigrams, function(i) first.word(i)) != 'love']
# exclude the specific bigram 'not love'
love_bigrams <- love_bigrams[!love_bigrams == 'not love']

And here's the result, we get a count of 2 for 'love', which has excluded the 'not love' bigram.

# inspect the results
inspect(txtTdmBi[love_bigrams])

A term-document matrix (2 terms, 3 documents)

Non-/sparse entries: 2/4
Sparsity           : 67%
Maximal term length: 9 
Weighting          : term frequency (tf)

           Docs
Terms       positiveText neutralText negativeText
  i love               1           0            0
  most love            1           0            0

# get counts of 'love' (excluding 'not love')
colSums(as.matrix(txtTdmBi[love_bigrams]))
positiveText  neutralText negativeText 
           2            0            0 

这篇关于如何在tm字典中实现接近规则以计数字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆