生成拼写错误的单词(错别字) [英] Generate misspelled words (typos)

查看：57 发布时间：2021/6/7 20:37:30 python nlp fuzzy-search

本文介绍了生成拼写错误的单词(错别字)的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经实现了一个模糊匹配算法，我想使用一些带有测试数据的示例查询来评估它的召回率.

I have implemented a fuzzy matching algorithm and I would like to evaluate its recall using some sample queries with test data.

假设我有一个包含文本的文档:

Let's say I have a document containing the text:

{"text": "The quick brown fox jumps over the lazy dog"}

我想看看是否可以通过测试诸如sox"或hazy drog"而不是fox"和lazy dog"之类的查询来检索它.

I want to see if I can retrieve it by testing queries such as "sox" or "hazy drog" instead of "fox" and "lazy dog".

换句话说，我想向字符串添加噪音以生成拼写错误的单词(错别字).

In other words, I want to add noise to strings to generate misspelled words (typos).

自动生成拼写错误的单词的方法是什么来评估模糊搜索?

推荐答案

我只想创建一个程序来随机更改您单词中的字母.我想您可以详细说明您的案例的具体要求，但总体思路是这样的.

I would just create a program to randomly alter letters in your words. I guess you can elaborate for specific requirements of your case, but the general idea would go like this.

说你有一个短语

phrase = "The quick brown fox jumps over the lazy dog"

然后定义一个词改变的概率(比如 10%)

Then define a probability for a word to change (say 10%)

p = 0.1

然后遍历您的短语中的单词，并从每个单词的均匀分布中进行采样.如果随机变量低于您的阈值，则从单词中随机更改一个字母

Then loop over the words of your phrase and sample from a uniform distribution for each one of them. If the random variable is lower than your threshold, then randomly change one letter from the word

import string
import random

new_phrase = []
words = phrase.split(' ')
for word in words:
    outcome = random.random()
    if outcome <= p:
        ix = random.choice(range(len(word)))
        new_word = ''.join([word[w] if w != ix else random.choice(string.ascii_letters) for w in range(len(word))])
        new_phrase.append(new_word)
    else:
        new_phrase.append(word)

new_phrase = ' '.join([w for w in new_phrase])

就我而言，我得到了以下有趣的短语结果

In my case I got the following interesting phrase result

print(new_phrase)
'The quick brown fWx jumps ovey the lazy dog'

这篇关于生成拼写错误的单词(错别字)的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

生成拼写错误的单词(错别字) [英] Generate misspelled words (typos)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

生成拼写错误的单词(错别字) [英] Generate misspelled words (typos)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭