生成拼写错误的单词(错别字) [英] Generate misspelled words (typos)

查看:57
本文介绍了生成拼写错误的单词(错别字)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经实现了一个模糊匹配算法,我想使用一些带有测试数据的示例查询来评估它的召回率.

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

假设我有一个包含文本的文档:

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

我想看看是否可以通过测试诸如sox"或hazy drog"而不是fox"和lazy dog"之类的查询来检索它.

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

换句话说,我想向字符串添加噪音以生成拼写错误的单词(错别字).

In other words, I want to add noise to strings to generate misspelled words (typos).

自动生成拼写错误的单词的方法是什么来评估模糊搜索?

推荐答案

我只想创建一个程序来随机更改您单词中的字母.我想您可以详细说明您的案例的具体要求,但总体思路是这样的.

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

说你有一个短语

phrase = "The quick brown fox jumps over the lazy dog"

然后定义一个词改变的概率(比如 10%)

Then define a probability for a word to change (say 10%)

p = 0.1

然后遍历您的短语中的单词,并从每个单词的均匀分布中进行采样.如果随机变量低于您的阈值,则从单词中随机更改一个字母

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import random

new_phrase = []
words = phrase.split(' ')
for word in words:
    outcome = random.random()
    if outcome <= p:
        ix = random.choice(range(len(word)))
        new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
        new_phrase.append(new_word)
    else:
        new_phrase.append(word)

new_phrase = ' '.join([w for w in new_phrase]) 

就我而言,我得到了以下有趣的短语结果

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'

这篇关于生成拼写错误的单词(错别字)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆