编辑距离:忽略开始/结束 [英] Edit distance: Ignore start/end

查看:98
本文介绍了编辑距离:忽略开始/结束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种可以编辑距离的算法,但是会忽略一个字符串和空白中的开始和结束:

I am looking for an algorithm that does edit distance, but which will ignore start+end in the one string and white space:

edit("four","foor") = 1
edit("four","noise fo or blur") = 1

是否存在现有的算法?甚至可能是Perl或Python库?

Is there an existing algorithm for that? Maybe even a Perl or a Python Library?

推荐答案

执行此操作的代码在概念上很简单。您可以自己添加以下内容,这就是您要忽略的想法:

The code to do this is simple in concept. It's your idea of what you'd like to ignore that you can add on your own:

#!perl
use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);

use Text::Levenshtein qw(distance);

say edit( "four", "foor" );
say edit( "four", "noise fo or blur" );

sub edit ( $start, $target ) {
    # transform strings to ignore what you want
    # ...
    distance( $start, $target )
    }

也许您要检查所有相同长度的子字符串:

Maybe you want to check all substrings of the same length:

use v5.22;
use feature qw(signatures);
no warnings qw(experimental::signatures);

use Text::Levenshtein qw(distance);

say edit( "four", "foar" );
say edit( "four", "noise fo or blur" );

sub edit ( $start, $target ) {
    my $start_length = length $start;
    $target =~ s/\s+//g;
    my @all_n_chars = map {
        substr $target, $_, 4
        } 0 .. ( length($target) - $start_length );

    my $closest;
    my $closest_distance = $start_length + 1;
    foreach ( @all_n_chars ) {
        my $distance = distance( $start, $_ );
        if( $distance < $closest_distance ) {
            $closest = $_;
            $closest_distance = $distance;
            say "closest: $closest Distance: $distance";
            last if $distance == 0;
            }
        }

    return $closest_distance;
    }

这个非常简单的实现可以找到您想要的东西。但是,请注意,其他随机字符串的编辑距离可能会偶然降低。

This very simpleminded implementation finds what you want. However, realize that other random strings might accidentally have an edit distance that is lower.

closest: foar Distance: 1
1
closest: nois Distance: 3
closest: foor Distance: 1
1

您可以扩展此功能以记住每个字符串的真实起始位置,以便可以在原始字符串中再次找到它,但这足以发送给您。如果您想使用Python,我认为程序可能看起来非常相似。

You could extend this to remember the true starting positions of each string so you can find it again in the original, but this should be enough to send you on your way. If you wanted to use Python, I think the program might look very similar.

这篇关于编辑距离:忽略开始/结束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆