机器学习的拼写检查算法 [英] machine learning algorithm for spelling check

查看:156
本文介绍了机器学习的拼写检查算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个药品名称列表(regular_list)和一个新名称列表(new_list).我想检查new_list中的名称是否已经存在于regular_list中,问题是名称new_list可以有一些拼写错误,我希望将这些名称视为与常规列表的匹配项.我知道使用stringdist可以解决该问题,但是我需要机器学习算法

I have a list of medicine names(regular_list) and a list of new names(new_list).I want to check whether the names in the new_list are already present in the regular_list or not.The issue is that the names new_list could have some typo errors and I want those name to be considered as a match to the regular list. I know that using stringdist is a solution to the problem but I need a machine learning algorithm

推荐答案

在这里已经提到过

As it was already mentioned here machine learning to overcome typo errors , machine learning tools are too much for such task, but the simplest possibility would be to merge those approaches.

一方面,您可以计算给定单词x和每个词典单词d_i之间的edit distance.另外,您可以训练每个单词的分类器

On one hand, you can compute the edit distance between given word x and each of the dictionary words d_i. Additionaly, you can traing per-word classifier

c(d_i, distance(x,d_i)) 

如果已经了解到给定的编辑距离足以认为xd_i的拼写错误版本,则

返回True(类1).这样可以为您提供更通用的模型,而不是不使用机器学习,因为每个字典单词的阈值可能不同(某些单词的拼写错误经常是其他单词的拼写错误),但是显然,您必须准备(misspelled_word, correct_one)形式的训练集(并还添加(correct_one, correct_one).

returning True (class 1) if a given edit distance has been learned to be sufficient to consider x a missspelled version of d_i. This can give you more general model then not using machine learning, as you can have different thresholds for each dictionary word (some words are more often misspelled then others), but obviously, you have to prepare a training set in form of (misspelled_word, correct_one) (and add also (correct_one, correct_one).

您可以将任何类型的二进制分类器用于此类任务,这些分类器可以处理实际"输入数据.

You can use any type of binary classifier for such task, which can work on "real" input data.

这篇关于机器学习的拼写检查算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆