R tm 在语料库中使用 gsub 替换单词 [英] R tm substitute words in Corpus using gsub

查看:28
本文介绍了R tm 在语料库中使用 gsub 替换单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 200 多个文档的大型文档语料库.正如您对如此庞大的语料库所期望的那样,有些单词拼写错误、使用的格式不同等等.我已经完成了标准的文本处理,例如转换为小写,删除标点符号,单词词干.在继续分析之前,我试图替换一些单词来纠正拼写并标准化它们.我使用与下面相同的语法完成了超过 100 次替换,对于大多数替换,它按预期工作.但是,有些(约 5%)不起作用.例如,以下替换似乎只有有限的效果:

I have a large document corpus with more than 200 documents. As you can expect from such a large corpus, some of the words are misspelled, used in different formats, and so on and so forth. I have done the standard text processing such as convert to lower case, remove punctuation, word stemming. I am trying to substitute some words to correct spelling and standardize them before moving on to analysis. I have done more that 100 substitution using the same syntax as below and for most of the substitutions, it is working as expected. However, some (about 5%) are not working. For example the following substitutions seem to have only limited effect:

docs <- tm_map(docs, content_transformer(gsub), pattern = "medecin|medicil|medicin|medicinee", replacement = "medicine")
docs <- tm_map(docs, content_transformer(gsub), pattern = "eephant|eleph|elephabnt|elleph|elephanyt|elephantant|elephantant", replacement = "elephant")
docs <- tm_map(docs, content_transformer(gsub), pattern = "firehood|firewod|firewoo|firewoodloc|firewoog|firewoodd|firewoodd", replacement = "firewood") 

我说的有限效果是指即使有些替换有效,有些则无效.例如,尽管尝试替换elephantant"、medicinee"、firewoodd",但当我创建 DTM(文档项矩阵).

By limited effect I mean that even though some substitutions are working, some are not. For example, despite trying to replace "elephantant", "medicinee", "firewoodd", they still exist when I create the DTM (document term matrix).

我不知道为什么会发生这种混合效应.

I have no idea why this mixed effect is happening.

还有下面这行用collect的一些组合替换语料库中的每个词:

Also the following line is replacing every word in the corpus with some combination of collect:

docs <- tm_map(docs, content_transformer(gsub), pattern = "colect|colleci|collectin|collectiong|collectng|colllect|", replacement = "collect")

仅供参考,当我只替换一个单词时,我使用的是语法(注意 fixed=TRUE):

Just for reference, when I substitute just a single word, I am using the syntax (notice the fixed=TRUE):

docs <- tm_map(docs, content_transformer(gsub), pattern = "charcola", replacement = "charcoal", fixed=TRUE)

单次替换失败的是:

docs <- tm_map(docs, content_transformer(gsub), pattern = "dogmonkeycat", replacement = "dog monkey cat", fixed=TRUE)

推荐答案

您遇到的问题是模式中的交替没有锚定,因此只有第一个匹配获胜",即使用,其余的是不考虑.

The issue you have is that the alternations in your patterns are not anchored, and thus only the first one matched "wins", i.e. used, and the rest is not considered.

您应该在交替周围使用一些锚点"(例如,单词边界):

You should either use some "anchors" (say, word boundaries) around the alternations:

pattern = "\\b(medecin|medicil|medicin|medicinee)\\b"

或者只是把较长的替代品放在较短的替代品之前:

or just put the longer alternatives before shorter ones:

pattern = "medicinee|medecin|medicil|medicin"

请注意,您可以通过对常见错误输入的元音(请参阅[ei])和组使用字符类来使模式更快:

Note that you can make the pattern faster by using character classes for commonly mistyped vowels (see [ei]) and groups:

pattern = "med[ie]ci(?:n(?:ee)?|l)"

这篇关于R tm 在语料库中使用 gsub 替换单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆