在拼写检查器中如何获取 3 个编辑后的单词(norvig) [英] In spell checker how to get the word that are 3 edits away(norvig)

查看:42
本文介绍了在拼写检查器中如何获取 3 个编辑后的单词(norvig)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试对我的数据库表使用拼写校正器来纠正一个表中的地址,为此我使用了 http://norvig.com/spell-correct.html使用 Address_mast 表作为字符串的集合,我试图在customer_master"

I have been trying to use spell corrector for my database table to correct the address from one table, for which I have used the reference of http://norvig.com/spell-correct.html Using the Address_mast table as a collection of strings I'm trying to correct and update the corrected string in "customer_master"

Address_mast

ID        Address
1    sonal plaza,harley road,sw-309012
2    rose apartment,kell road, juniper, la-293889
3    plot 16, queen's tower, subbden - 399081
4    cognizant plaza, abs road, ziggar - 500234

现在从参考代码中,它只针对那些与单词相距两次编辑"的单词完成.但我正在尝试将其执行 3 或直到 4,同时尝试更新那些更正的单词到其他表.这是包含拼写错误的单词的表,将使用更正的单词进行更新

now from the reference code it has been done only for those words which are "two edits away from word".but I'm trying to do it for 3 or till 4 and at the same time trying to update those corrected words to other table.here is the table which contains misspell words and is to be updated with corrected words

Customer_master

Address_1

josely apartmt,kell road, juneeper, la-293889
zoonal plaza, harli road,sw-309012
plot 16, queen's tower, subbden - 399081
cognejantt pluza, abs road, triggar - 500234

这是我尝试过的

import re
import pyodbc
import numpy as np
from collections import Counter

cnxn = pyodbc.connect('DRIVER={SQLServer};SERVER=localhost;DATABASE=DBM;UID=ADMIN;PWD=s@123;autocommit=True')
cursor = cnxn.cursor()
cursor.execute("select address as data  from Address_mast")
data=[]
for row in cursor.fetchall():

    data.append(row[0]) 

data = np.array(data)

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('data').read()))
def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or known(edits3(word)) or known(edits4(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

def edits3(word): 

    return (e3 for e2 in edits2(word) for e3 in edits1(e2))

def edits4(word): 

    return (e4 for e3 in edits3(word) for e4 in edits1(e3))


sqlstr = ""
j=0
k=0
for i in data:
    sqlstr=" update customer_master set Address='"+correction(data)+"' where data="+correction(data)
    cursor.execute(sqlstr)

    j=j+1
    k=k+cursor.rowcount
cnxn.commit()
cursor.close()
cnxn.close()
print(str(k) +" Records Completed")

由此我无法获得正确的输出,关于应该进行哪些更改的任何建议..提前致谢

from this I m unable to get proper output, any suggestion on what changes shuld be made..Thanks in advance

推荐答案

上面的答案是可以的,但是有一个比检查编辑距离 k 的指数增长的字符串集更快的解决方案.假设我们有一个数据结构,将所有单词的集合存储在树结构中.这很有用,因为我们知道,例如,我们不需要搜索没有单词的路径.这既节省内存又节省计算.

The above answers are ok, but there is a faster solution than checking the exponentially increasing set of strings of edit distance k. Suppose we had a data structure that stored the set of all words in a tree structure. This is useful because we know, for example, that we need not search paths in which there are no words. This is both memory efficient and computationally efficient.

假设我们有一个词汇表存储在 set、dict 或理想情况下的 collections.Counter 对象中,那么我们可以如下设置数据结构:

Suppose we have a vocabulary stored in a set, dict, or a ideally, a collections.Counter object, then we can set up the data structure as follows:

class VocabTreeNode:
    def __init__(self):
        self.children = {}
        self.word = None
        
    def build(self, vocab):
        for w in vocab:
            self.insert(w)

    def insert( self, word):
        node = self
        for letter in word:
            if letter not in node.children: 
                node.children[letter] = VocabTreeNode()
            node = node.children[letter]
        node.word = word

为了仅搜索距单词编辑距离为 k 的元素集,我们可以赋予该结构递归搜索.

To search only the set of elements of edit distance k from the word, we may endow this structure with a recursive search.

    def search(self, word, maxCost):
        currentRow = range( len(word) + 1 )    
        results = []
        for letter in self.children:
            self.searchRecursive(self.children[letter], letter, 
                                 word, currentRow, results, 
                                 maxCost)   
        return results
            
    def searchRecursive(self, node, letter, word, previousRow, 
                        results, maxCost):
        columns = len( word ) + 1
        currentRow = [ previousRow[0] + 1 ]
        for column in range( 1, columns ):
            insertCost = currentRow[column - 1] + 1
            deleteCost = previousRow[column] + 1
            if word[column - 1] != letter:
                replaceCost = previousRow[ column - 1 ] + 1
            else:                
                replaceCost = previousRow[ column - 1 ]
            currentRow.append( min( insertCost, deleteCost, replaceCost ) )
    
        if currentRow[-1] <= maxCost and node.word != None:
            results.append( (node.word, currentRow[-1] ) )
        if min( currentRow ) <= maxCost:
            for next_letter in node.children:
                self.searchRecursive( node.children[next_letter], next_letter, word,
                                      currentRow, results, maxCost)

只有一个问题我不知道如何克服;换位作为路径无效,所以我不确定如何将换位合并为编辑距离 1 而不进行一些复杂的 hack.

There is just one problem that I'm not sure how to overcome; transpositions are not valid as paths, so i'm not sure how to incorporate transpositions as edit distance 1 without a somewhat complicated hack.

我的语料库是 97722(几乎所有 Linux 发行版中的词集).

My corpus of words was 97722 (the set of words in almost any linux distro).

sleep(1)
start = time()

for i in range(100):
    x = V.search('elephant',3)
    
print(time()- start)

>>> 17.5 

这相当于每 0.175 秒编辑这个词的距离 3 计算.编辑距离 4 能够在 0.377 秒内完成,而使用编辑 1 的连续编辑距离将很快导致您的系统内存不足.

Which equates to edit distance 3 calculations for this word every 0.175 seconds. Edit distance 4 was able to be done in .377 seconds, whereas consecutive edit distances using the edits1 will quickly cause your system to run out of memory.

注意不容易处理换位,这是实现高编辑距离的 Norvig 类型算法的一种快速有效的方法.

With the caveat of not easily handling transpositions, this is a fast effective way of implementing a Norvig-type algorithm for high edit distances.

这篇关于在拼写检查器中如何获取 3 个编辑后的单词(norvig)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆