替换R data.frames(文本挖掘)中的单词 [英] replace words in R data.frames (Text Mining)
问题描述
首先从我的SQL选择将数据导入到R中,而不是使用数据挖掘技术
这是我得到的:
rawData = sqlQuery (dwhConnect,sqlString)
a = data.frame(rawData $ ENNOTE_NEU)
如果我一个
a [[1]] [1:3]
你看到结构:
[1] lorem ipsum li ld eewöwo di dd
[2] la kdin di da dogs chicken
[3] kd好我需要一些帮助
现在我想用我自己的字典做一些数据清理。
一个例子是将 li 替换为 lorem ipsum 和 kd 以及 kdin strong> kunde
我的问题是如何处理整个数据框架。
for(i in 1:(nrow(a)))
{
a [[1]] [i] = gsub(kd | kdin,kunde ,a [[1]] [i])
a [[1]] [i] = gsub(li,lorem ipsum,a [[1]] [i])
...
}
工作,但对于大量数据而言很慢。
有更好的方法吗?
欢呼船长
gsub
被矢量化,所以你不需要循环。
a [[1]]< - gsub(kd | kdin,kunde [[1]])
更快。
另外,你确定要在你的正则表达式里面有空格吗?这样你就不会在行的开始或结尾处匹配单词。
I'm working on a Text Mining Solution with SQL and R.
First I Import Data into R from my SQL selection and than I do data mining stuff with it.
Here is what I got:
rawData = sqlQuery(dwhConnect,sqlString)
a = data.frame(rawData$ENNOTE_NEU)
If I do a
a[[1]][1:3]
you see the structure:
[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help
Now I want to do some data cleaning with my own dictionary. An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde
My Problem is how to do that for the whole Data Frame.
for(i in 1:(nrow(a)))
{
a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
}
works but is slow for a lot of data.
Is there a better way to do that?
cheers The Captain
gsub
is vectorised, so you don't need the loop.
a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])
is quicker.
Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.
这篇关于替换R data.frames(文本挖掘)中的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!