R函数通过更接近的单词的频率来纠正单词 [英] R function to correct words by frequency of more proximate word
问题描述
我的桌子上有拼写错误的单词.我需要使用与该词更相似的词(频率更高的词)来更正那些词.
I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.
例如,在我运行
aggregate(CustomerID ~ Province, ventas2, length)
我知道
1
2 AMBA 29
3 BAIRES 1
4 BENOS AIRES 1
12 BUENAS AIRES 1
17 BUENOS AIRES 4
18 buenos aires 7
19 Buenos Aires 3
20 BUENOS AIRES 11337
35 CORDOBA 2297
36 cordoba 1
38 CORDOBESA 1
39 CORRIENTES 424
因此,我需要用BUENOS AIRES替换布宜诺斯艾利斯,布宜诺斯艾利斯,贝勒斯,BUENOS AIRES,但不应该替换AMBA.另外,CORDOBESA和科尔多瓦也应该用CORDOBA代替,而不是CORRIENTES.
So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.
如何在R中做到这一点?
How can I do this in R?
谢谢!
推荐答案
这是一个可能的解决方案.
Here's a possibile solution.
免责声明:
这段代码似乎可以在您当前的示例中正常工作.我不确定当前参数(例如,切高,簇集方法,距离方法等)对您的真实(完整)数据是否有效.
Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.
# recreating your data
data <-
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)
# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))
# string distance computation
# N.B. here you can play with single components of distance costs
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')
# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters)
# N.B. I decided to cut at distance height < 5
# (read it as: "I consider equal 2 strings needing
# less than 5 modifications to pass from one to the other")
# Obviously you can use another value.
rect.hclust(hc,h=4.9)
# get the clusters ids
clusters <- cutree(hc,h=4.9)
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)
# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City')
# add CityCorrected column to the merged data.frame
ret <- by(merged,
merged$ClusterId,
FUN=function(grp){
idx <- which.max(grp$Occur)
grp$CityCorrected <- grp[idx,'City']
return(grp)
})
fixed <- do.call(rbind,ret)
结果:
> fixed
City Occurr ClusterId CityCorrected
1 AMBA 29 1 AMBA
2.2 BAIRES 1 2 BUENOS AIRES
2.3 BENOS AIRES 1 2 BUENOS AIRES
2.4 BUENAS AIRES 1 2 BUENOS AIRES
2.5 BUENOS AIRES 4 2 BUENOS AIRES
2.6 buenos aires 7 2 BUENOS AIRES
2.7 Buenos Aires 3 2 BUENOS AIRES
2.8 BUENOS AIRES 11337 2 BUENOS AIRES
3.9 cordoba 1 3 CORDOBA
3.10 CORDOBA 2297 3 CORDOBA
3.11 CORDOBESA 1 3 CORDOBA
4 CORRIENTES 424 4 CORRIENTES
集群图:
这篇关于R函数通过更接近的单词的频率来纠正单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!