删除动词作为停用词 [英] Remove a verb as a stopword

查看:185
本文介绍了删除动词作为停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有些单词有时被用作动词,有时被用作语言的其他部分.

There are some words which are used sometimes as a verb and sometimes as other part of speech.

示例

一词的含义是动词:

I blame myself for what happened

还有一个单词的名词为名词的句子:

And a sentence with the meaning of word as noun:

For what happened the blame is yours

我想检测的单词是我所知道的,在上面的示例中是责备".我只想在具有动词含义的情况下才将其检测为停用词.

The word I want to detect is known to me, in the example above is "blame". I would like to detect and remove as stopwords only when it has meaning like a verb.

有什么简单的方法可以做到吗?

Is there any easy way to make it?

推荐答案

您可以

You can install TreeTagger and then use the koRpus package in R to use TreeTagger from R. Install it in a location like e.g. C:\Treetagger.

我将首先展示treetagger的工作原理,以便您了解此答案下方的实际解决方案中的情况:

I will first show how treetagger works so you understand what's going in the actual solution further down below in this answer:

library(koRpus)

your_sentences <- c("I blame myself for what happened", 
                    "For what happened the blame is yours")

text.tagged <- treetag(file="I blame myself for what happened", 
                  format="obj", treetagger="manual", lang="en",
                  TT.options = list(path="C:\\Treetagger", preset="en") )
text.tagged@TT.res[, 1:2]
#       token tag    
#1         I  PP
#2     blame VVP 
#3    myself  PP 
#4       for  IN
#5      what  WP
#6  happened VVD 

现在已经对句子进行了分析,剩下的唯一内容"是删除出现在动词上的"blame".

The sentences have been analysed now and the "only thing left" is to remove those occurrences of "blame" that are a verb.

我将通过创建一个函数来对句子进行句子处理,该函数首先标记句子,然后检查像"blame"一样也是动词的坏词",最后将它们从句子中删除:

I'll do this sentence for sentence by creating a function that first tags the sentence, then checks for "bad words" like "blame" that are also a verb and finally removes them from the sentence:

remove_words <- function(sentence, badword="blame"){
  tagged.text <- treetag(file=sentence, format="obj", treetagger="manual", lang="en", 
                         TT.options=list(path=":C\\Treetagger", preset="en"))
  # Check for bad words AND verb:
  cond1 <- (tagged.text@TT.res$token == badword)
  cond2 <- (substring(tagged.text@TT.res$tag, 0, 1) == "V")
  redflag <- which(cond1 & cond2)

  # If no such case, return sentence as is. If so, then remove that word:
  if(length(redflag) == 0) return(sentence)
  else{
    splitsent <- strsplit(sentence, " ")[[1]]
    splitsent <- splitsent[-redflag]
    return(paste0(splitsent, collapse=" "))
  }
}

lapply(your_sentences, remove_words)
# [[1]]
# [1] "I myself for what happened"
# [[2]]
# [1] "For what happened the blame is yours"

这篇关于删除动词作为停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆