如何在python中生成一组相似的字符串 [英] how to generate a set of similar strings in python

查看:258
本文介绍了如何在python中生成一组相似的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何基于Levenshtein distance(字符串编辑距离)生成一组相似的字符串.理想情况下,我喜欢传递源字符串(即用于生成与其相似的其他字符串的字符串),需要生成的字符串数和阈值作为参数,即,生成的集合应大于阈值.我想知道应该使用什么Python软件包?或任何想法如何实现这一目标?

I am wondering how to generate a set of similar strings based on Levenshtein distance (string edit distance). Ideally, I like to pass in, a source string (i.e. a string which is used to generate other strings that are similar to it), the number of strings need to be generated and a threshold as parameters, i.e. similarities among the strings in the generated set should be greater than the threshold. I am wondering what Python package(s) should I use to achieve that? Or any idea how to implement this?

推荐答案

我认为您可以用另一种方式来思考问题(反向).

I think you can think of the problem in another way (reversed).

  • 给出一个字符串,说它是 sittin .
  • 给出一个阈值(编辑距离),说它是k.
  • 然后您以k个步骤应用不同编辑"的组合.
  • Given a string, say it is sittin.
  • Given a threshold (edit distance), say it is k.
  • Then you apply combinations of different "edits" in k-steps.

例如,假设k =2.并假设您拥有允许的编辑模式是:

For example, let's say k = 2. And assume the allowed edit modes you have are:

  • 删除一个字符
  • 添加一个字符
  • 用一个字符替换另一个字符.

然后逻辑如下:

input = 'sittin'
for num in 1 ... n:  # suppose you want to have n strings generated
  my_input_ = input
  # suppose the edit distance should be smaller or equal to k;
  # but greater or equal to one
  for i in in 1 ... randint(k): 
    pick a random edit mode from (delete, add, substitute)
    do it! and update my_input_

如果您需要使用预定义的字典,那么这会增加一些复杂性,但是仍然可以实现.在这种情况下,编辑必须有效.

If you need to stick with a pre-defined dictionary, that adds some complexity but it is still doable. In this case, the edit must be valid.

这篇关于如何在python中生成一组相似的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆