生成拼写错误的单词(错别字) [英] Generate misspelled words (typos)
问题描述
我已经实现了一个模糊匹配算法,我想使用一些带有测试数据的示例查询来评估它的召回率.
I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.
假设我有一个包含文本的文档:
Let's say I have a document containing the text:
{"text": "The quick brown fox jumps over the lazy dog"}
我想看看是否可以通过测试诸如sox"或hazy drog"而不是fox"和lazy dog"之类的查询来检索它.
I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".
换句话说,我想向字符串添加噪音以生成拼写错误的单词(错别字).
In other words, I want to add noise to strings to generate misspelled words (typos).
自动生成拼写错误的单词的方法是什么来评估模糊搜索?
推荐答案
我只想创建一个程序来随机更改您单词中的字母.我想您可以详细说明您的案例的具体要求,但总体思路是这样的.
I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.
说你有一个短语
phrase = "The quick brown fox jumps over the lazy dog"
然后定义一个词改变的概率(比如 10%)
Then define a probability for a word to change (say 10%)
p = 0.1
然后遍历您的短语中的单词,并从每个单词的均匀分布中进行采样.如果随机变量低于您的阈值,则从单词中随机更改一个字母
Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word
import string
import random
new_phrase = []
words = phrase.split(' ')
for word in words:
outcome = random.random()
if outcome <= p:
ix = random.choice(range(len(word)))
new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
new_phrase.append(new_word)
else:
new_phrase.append(word)
new_phrase = ' '.join([w for w in new_phrase])
就我而言,我得到了以下有趣的短语结果
In my case I got the following interesting phrase result
print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'
这篇关于生成拼写错误的单词(错别字)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!