tm 包中的词干文档不适用于过去时词 [英] stemDocment in tm package not working on past tense word

查看:21
本文介绍了tm 包中的词干文档不适用于过去时词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件check_text.txt",其中包含说要制作".我想对它进行词干提取以获得说说说做做".我尝试在 tm 包中使用 stemDocument ,如下所示,但只得到said say say make made".有没有办法对过去时词进行词干提取?在现实世界的自然语言处理中是否有必要这样做?谢谢!

I have a file 'check_text.txt' that contains "said say says make made". I'd like to perform stemming on it to get "say say say make make". I tried to use stemDocument in tm package, as the following, but only get "said say say make made". Is there a way to perform stemming on past tense words? Is it necessary to do so in real-world natural language processing? Thanks!

filename = 'check_text.txt'
con <- file(filename, "rb")
text_data <- readLines(con,skipNul = TRUE)
close(con)
text_VS <- VectorSource(text_data)
text_corpus <- VCorpus(text_VS)
text_corpus <- tm_map(text_corpus, stemDocument, language = "english")
as.data.frame(text_corpus)$text

EDIT:我也在 SnowballC

> library(SnowballC)
> wordStem(c("said", "say", "says", "make", "made"))
[1] "said" "sai"  "sai"  "make" "made"

推荐答案

如果一个包中有一个不规则英语动词的数据集,这个任务就很容易了.我只是不知道任何包含此类数据的包,因此我选择通过抓取来创建自己的数据库.我不确定这个网站是否涵盖了所有不规则的词.如有必要,您希望搜索更好的网站来创建您自己的数据库.一旦你有了你的数据库,你就可以从事你的任务了.

If there is a data set of irregular English verbs in a package, this task would be easy. I just do not know any packages with such data, so I chose to create my own database by scraping. I am not sure if this website covers all irregular words. If necessary, you want to search better websites to create your own database. Once you have your database, You can engage in your task.

首先,我使用 stemDocument() 并使用 -s 清理当前表单.然后,我在words(即past)中收集过去式,过去式的不定式(即inf1),确定顺序temp 中过去的形式.我在 temp 中进一步确定了过去表格的位置.我终于用它们的不定式形式替换了 sat 形式.我对过去分词重复了同样的过程.

First, I used stemDocument() and clean up present forms with -s. Then, I collected past forms in words (i.e., past), infinitive forms of the past forms (i.e., inf1),identified the order of the past forms in temp. I further identified the positions of the past forms in temp. I finally replaced the sat forms with their infinitive forms. I repeated the same procedure for past participles.

library(tm)
library(rvest)
library(dplyr)
library(splitstackshape)


### Create a database
x <- read_html("http://www.englishpage.com/irregularverbs/irregularverbs.html")

x %>%
html_table(header = TRUE) %>%
bind_rows %>%
rename(Past = `Simple Past`, PP = `Past Participle`) %>%
filter(!Infinitive %in% LETTERS) %>%
cSplit(splitCols = c("Past", "PP"),
       sep = " / ", direction = "long") %>%
filter(complete.cases(.)) %>%
mutate_each(funs(gsub(pattern = "\s\(.*\)$|\s\[\?\]",
                      replacement = "",
                      x = .))) -> mydic

### Work on the task

words <- c("said", "drawn", "say", "says", "make", "made", "done")

### says to say
temp <- stemDocument(words)

### past forms become present form
### Collect past forms
past <- mydic$Past[which(mydic$Past %in% temp)]

### Collect infinitive forms of past forms
inf1 <- mydic$Infinitive[which(mydic$Past %in% temp)]

### Identify the order of past forms in temp
ind <- match(temp, past)
ind <- ind[is.na(ind) == FALSE]

### Where are the past forms in temp?
position <- which(temp %in% past)

temp[position] <- inf1[ind]

### Check
temp
#[1] "say"   "drawn" "say"   "say"   "make"  "make"  "done" 


### PP forms to infinitive forms (same as past forms)

pp <- mydic$PP[which(mydic$PP %in% temp)]
inf2 <- mydic$Infinitive[which(mydic$PP %in% temp)]
ind <- match(temp, pp)
ind <- ind[is.na(ind) == FALSE]
position <- which(temp %in% pp)
temp[position] <- inf2[ind]

### Check
temp
#[1] "say"  "draw" "say"  "say"  "make" "make" "do" 

这篇关于tm 包中的词干文档不适用于过去时词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆