R函数通过更接近的单词的频率来纠正单词 [英] R function to correct words by frequency of more proximate word

查看:42
本文介绍了R函数通过更接近的单词的频率来纠正单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的桌子上有拼写错误的单词.我需要使用与该词更相似的词(频率更高的词)来更正那些词.

I have a table with misspelling words. I need to correct those using from the words more similar to that one, the one that have more frequency.

例如,在我运行

aggregate(CustomerID ~ Province, ventas2, length)

我知道

1                             
2                     AMBA         29
    3                   BAIRES          1
    4              BENOS AIRES          1

    12            BUENAS AIRES          1

    17           BUENOS  AIRES          4
    18            buenos aires          7
    19            Buenos Aires          3
    20            BUENOS AIRES      11337
    35                 CORDOBA       2297
    36                cordoba           1
    38               CORDOBESA          1
    39              CORRIENTES        424

因此,我需要用BUENOS AIRES替换布宜诺斯艾利斯,布宜诺斯艾利斯,贝勒斯,BUENOS AIRES,但不应该替换AMBA.另外,CORDOBESA和科尔多瓦也应该用CORDOBA代替,而不是CORRIENTES.

So I need to replace buenos aires, Buenos Aires, Baires, BUENOS AIRES, with BUENOS AIRES but AMBA shouldn't be replaced. Also CORDOBESA and cordoba should be replaced by CORDOBA, but not CORRIENTES.

如何在R中做到这一点?

How can I do this in R?

谢谢!

推荐答案

这是一个可能的解决方案.

Here's a possibile solution.

免责声明:
这段代码似乎可以在您当前的示例中正常工作.我不确定当前参数(例如,切高,簇集方法,距离方法等)对您的真实(完整)数据是否有效.

Disclaimer :
This code seems to works fine with your current example. I don't assure that the current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) will be valid for your real (complete) data.

# recreating your data
data <- 
read.csv(text=
'City,Occurr
AMBA,29
BAIRES,1
BENOS AIRES,1
BUENAS AIRES,1
BUENOS  AIRES,4
buenos aires,7
Buenos Aires,3
BUENOS AIRES,11337
CORDOBA,2297
cordoba,1
CORDOBESA,1
CORRIENTES,424',stringsAsFactors=F)


# simple pre-processing to city strings:
# - removing spaces
# - turning strings to uppercase
cities <- gsub('\\s+','',toupper(data$City))

# string distance computation
# N.B. here you can play with single components of distance costs 
d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1))
# assign original cities names to distance matrix
rownames(d) <- data$City
# clustering cities
hc <- hclust(as.dist(d),method='single')

# plot the cluster dendrogram
plot(hc)
# add the cluster rectangles (just to see the clusters) 
# N.B. I decided to cut at distance height < 5
#      (read it as: "I consider equal 2 strings needing
#       less than 5 modifications to pass from one to the other")
#      Obviously you can use another value.
rect.hclust(hc,h=4.9)

# get the clusters ids
clusters <- cutree(hc,h=4.9) 
# turn into data.frame
clusters <- data.frame(City=names(clusters),ClusterId=clusters)

# merge with frequencies
merged <- merge(data,clusters,all.x=T,by='City') 

# add CityCorrected column to the merged data.frame
ret <- by(merged, 
          merged$ClusterId,
          FUN=function(grp){
                idx <- which.max(grp$Occur)
                grp$CityCorrected <- grp[idx,'City']
                return(grp)
              })

fixed <- do.call(rbind,ret)

结果:

> fixed
              City Occurr ClusterId CityCorrected
1             AMBA     29         1          AMBA
2.2         BAIRES      1         2  BUENOS AIRES
2.3    BENOS AIRES      1         2  BUENOS AIRES
2.4   BUENAS AIRES      1         2  BUENOS AIRES
2.5  BUENOS  AIRES      4         2  BUENOS AIRES
2.6   buenos aires      7         2  BUENOS AIRES
2.7   Buenos Aires      3         2  BUENOS AIRES
2.8   BUENOS AIRES  11337         2  BUENOS AIRES
3.9        cordoba      1         3       CORDOBA
3.10       CORDOBA   2297         3       CORDOBA
3.11     CORDOBESA      1         3       CORDOBA
4       CORRIENTES    424         4    CORRIENTES

集群图:

这篇关于R函数通过更接近的单词的频率来纠正单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆