如何使用hunspell软件包在R中的一列中建议正确的单词? [英] How to use hunspell package to suggest correct words in a column in R?

查看:204
本文介绍了如何使用hunspell软件包在R中的一列中建议正确的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在处理一个大型数据框,每行包含很多文本,并希望使用hunspell包有效地识别和替换每个句子中的拼写错误的单词.我能够识别拼写错误的单词,但无法弄清楚如何在列表上执行hunspell_suggest.

I'm currently working with a large data frame containing lots of text in each row and would like to effectively identify and replace misspelled words in each sentence with the hunspell package. I was able to identify the misspelled words, but can't figure out how to do hunspell_suggest on a list.

以下是数据框的示例:

df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
                                            "Mary and Samantha arived at the bus staton before noon",
                                            "I did not see thm at the station in the mrning",
                                            "The participnts read 60 sentences in radom order",
                                            "how to fix mispelled words in R languge",
                                            "today is Tuesday",
                                            "bing sports quiz"))

我将文本列转换为字符,并使用hunspell识别每一行中的拼写错误的单词.

I converted the text column into character and used hunspell to identify the misspelled words within each row.

library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)

我尝试过

df1$suggest <- hunspell_suggest(df1$word_check)

但它一直显示此错误:

Error in hunspell_suggest(df1$word_check) : 
  is.character(words) is not TRUE

我对此并不陌生,所以我不确定使用hunspell_suggest函数的建议列会如何变化.任何帮助将不胜感激.

I'm new to this so I'm not exactly sure how does the suggest column using hunspell_suggest function would turn out. Any help will be greatly appreciated.

推荐答案

检查您的中间步骤. df1$word_check的输出如下:

Check your intermediate steps. The output of df1$word_check is as follows:

List of 5
 $ : chr [1:2] "complec" "independet"
 $ : chr [1:2] "arived" "staton"
 $ : chr [1:2] "thm" "mrning"
 $ : chr [1:2] "participnts" "radom"
 $ : chr [1:2] "mispelled" "languge"

,其类型为list.如果您做了lapply(df1$word_check, hunspell_suggest),则可以获得建议.

which is of type list. If you did lapply(df1$word_check, hunspell_suggest) you can get the suggestions.

编辑

由于我没有发现任何简单的选择,因此我决定对这个问题进行更详细的介绍.这是我想出的:

I decided to go into more detail on this question as I have not seen any easy alternative. This is what I have come up with:

cleantext = function(x){

  sapply(1:length(x),function(y){
    bad = hunspell(x[y])[[1]]
    good = unlist(lapply(hunspell_suggest(bad),`[[`,1))

    if (length(bad)){
      for (i in 1:length(bad)){
        x[y] <<- gsub(bad[i],good[i],x[y])
      }}})
  x
}

尽管可能有一种更优雅的方法,但此函数返回的字符串矢量经过如下校正:

Although there probably is a more elegant way of doing it, this function returns a vector of character strings corrected as such:

> df1$Text
[1] "A complec sentence joins an independet"                
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"        
[4] "The participnts read 60 sentences in radom order"      
[5] "how to fix mispelled words in R languge"               
[6] "today is Tuesday"                                      
[7] "bing sports quiz" 

> cleantext(df1$Text)
[1] "A complex sentence joins an independent"               
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"      
[4] "The participants read 60 sentences in radon order"     
[5] "how to fix misspelled words in R language"             
[6] "today is Tuesday"                                      
[7] "bung sports quiz" 

当心,因为这会返回hunspell给出的第一个建议-可能正确也可能不正确.

Watch out, as this returns the first suggestion given by hunspell - which may or may not be correct.

这篇关于如何使用hunspell软件包在R中的一列中建议正确的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆