检测两个字符串之间的差异 [英] Detect differences between two strings

查看:100
本文介绍了检测两个字符串之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个字符串

string a = "foo bar";
string b = "bar foo";

,我想检测 a 到 b 要从 a 变为 b ,我必须更改哪些字符?

and I want to detect the changes from a to b. What characters do I have to change, to get from a to b?

我认为每个字符都必须进行迭代,并检测它是否被添加,删除或保持相等。所以这是我正确的结果

I think there must be a iteration over each character and detect if it was added, removed or remained equal. So this is my exprected result

'f' Remove
'o' Remove
'o' Remove
' ' Remove
'b' Equal
'a' Equal
'r' Equal
' ' Add
'f' Add
'o' Add
'o' Add

类并为结果枚举:

public enum Operation { Add,Equal,Remove };
public class Difference
{
    public Operation op { get; set; }
    public char c { get; set; }
}

这是我的解决方法,但是 Remove情况不清楚代码的外观

Here is my solution but the "Remove" case is not clear to me how the code has to look like

public static List<Difference> CalculateDifferences(string left, string right)
{
    int count = 0;
    List<Difference> result = new List<Difference>();
    foreach (char ch in left)
    {
        int index = right.IndexOf(ch, count);
        if (index == count)
        {
            count++;
            result.Add(new Difference() { c = ch, op = Operation.Equal });
        }
        else if (index > count)
        {
            string add = right.Substring(count, index - count);
            result.AddRange(add.Select(x => new Difference() { c = x, op = Operation.Add }));
            count += add.Length;
        }
        else
        {
            //Remove?
        }
    }
    return result;
}

已删除字符的代码看起来如何?

How does the code have to look like for removed characters?

更新-添加了更多示例

示例1:

string a = "foobar";
string b = "fooar";

预期结果:

'f' Equal
'o' Equal
'o' Equal
'b' Remove
'a' Equal
'r' Equal

示例2:

string a = "asdfghjk";
string b = "wsedrftr";

预期结果:

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add






更新:

这是Dmitry和ingen的答案之间的比较 https://dotnetfiddle.net/MJQDAO

Here is a comparison between Dmitry's and ingen's answer: https://dotnetfiddle.net/MJQDAO

推荐答案

(最小)编辑距离 / (最小)编辑顺序。您可以在此处找到该过程的理论

You are looking for (minimum) edit distance / (minimum) edit sequence. You can find the theory of the process here:

https://web.stanford.edu/class/cs124/lec/med.pdf

让我们实现(最简单的)Levenstein距离/序列算法(有关详细信息,请参见 https://en.wikipedia.org/ wiki / Levenshtein_distance )。让我们从 helper 类开始(我对您的实现做了一些更改):

Let's implement (simplest) Levenstein Distance / Sequence algorithm (for details see https://en.wikipedia.org/wiki/Levenshtein_distance). Let's start from helper classes (I've changed a bit your implementation of them):

  public enum EditOperationKind : byte {
    None,    // Nothing to do
    Add,     // Add new character
    Edit,    // Edit character into character (including char into itself)
    Remove,  // Delete existing character
  };

  public struct EditOperation {
    public EditOperation(char valueFrom, char valueTo, EditOperationKind operation) {
      ValueFrom = valueFrom;
      ValueTo = valueTo;

      Operation = valueFrom == valueTo ? EditOperationKind.None : operation;
    }

    public char ValueFrom { get; }
    public char ValueTo {get ;}
    public EditOperationKind Operation { get; }

    public override string ToString() {
      switch (Operation) {
        case EditOperationKind.None:
          return $"'{ValueTo}' Equal";
        case EditOperationKind.Add:
          return $"'{ValueTo}' Add";
        case EditOperationKind.Remove:
          return $"'{ValueFrom}' Remove";
        case EditOperationKind.Edit:
          return $"'{ValueFrom}' to '{ValueTo}' Edit";
        default:
          return "???";
      }
    }
  }

据我所知根据提供的示例,我们没有任何 edit 操作,但是 add + remove ;这就是为什么当 insertCost = 1 int removeCost = 1时我放置 editCost = 2 的原因(对于 tie 插入+删除编辑我们将插入并删除)。
现在我们准备实现Levenstein算法:

As far as I can see from the examples provided we don't have any edit operation, but add + remove; that's why I've put editCost = 2 when insertCost = 1, int removeCost = 1 (in case of tie: insert + remove vs. edit we put insert + remove). Now we are ready to implement Levenstein algorithm:

public static EditOperation[] EditSequence(
  string source, string target, 
  int insertCost = 1, int removeCost = 1, int editCost = 2) {

  if (null == source)
    throw new ArgumentNullException("source");
  else if (null == target)
    throw new ArgumentNullException("target");

  // Forward: building score matrix

  // Best operation (among insert, update, delete) to perform 
  EditOperationKind[][] M = Enumerable
    .Range(0, source.Length + 1)
    .Select(line => new EditOperationKind[target.Length + 1])
    .ToArray();

  // Minimum cost so far
  int[][] D = Enumerable
    .Range(0, source.Length + 1)
    .Select(line => new int[target.Length + 1])
    .ToArray();

  // Edge: all removes
  for (int i = 1; i <= source.Length; ++i) {
    M[i][0] = EditOperationKind.Remove;
    D[i][0] = removeCost * i;
  }

  // Edge: all inserts 
  for (int i = 1; i <= target.Length; ++i) {
    M[0][i] = EditOperationKind.Add;
    D[0][i] = insertCost * i;
  }

  // Having fit N - 1, K - 1 characters let's fit N, K
  for (int i = 1; i <= source.Length; ++i)
    for (int j = 1; j <= target.Length; ++j) {
      // here we choose the operation with the least cost
      int insert = D[i][j - 1] + insertCost;
      int delete = D[i - 1][j] + removeCost;
      int edit = D[i - 1][j - 1] + (source[i - 1] == target[j - 1] ? 0 : editCost);

      int min = Math.Min(Math.Min(insert, delete), edit);

      if (min == insert) 
        M[i][j] = EditOperationKind.Add;
      else if (min == delete)
        M[i][j] = EditOperationKind.Remove;
      else if (min == edit)
        M[i][j] = EditOperationKind.Edit;

      D[i][j] = min;
    }

  // Backward: knowing scores (D) and actions (M) let's building edit sequence
  List<EditOperation> result = 
    new List<EditOperation>(source.Length + target.Length);

  for (int x = target.Length, y = source.Length; (x > 0) || (y > 0);) {
    EditOperationKind op = M[y][x];

    if (op == EditOperationKind.Add) {
      x -= 1;
      result.Add(new EditOperation('\0', target[x], op));
    }
    else if (op == EditOperationKind.Remove) {
      y -= 1;
      result.Add(new EditOperation(source[y], '\0', op));
    }
    else if (op == EditOperationKind.Edit) {
      x -= 1;
      y -= 1;
      result.Add(new EditOperation(source[y], target[x], op));
    }
    else // Start of the matching (EditOperationKind.None)
      break;
  }

  result.Reverse();

  return result.ToArray();
}

演示:

var sequence = EditSequence("asdfghjk", "wsedrftr"); 

Console.Write(string.Join(Environment.NewLine, sequence));

结果:

'a' Remove
'w' Add
's' Equal
'e' Add
'd' Equal
'r' Add
'f' Equal
'g' Remove
'h' Remove
'j' Remove
'k' Remove
't' Add
'r' Add

这篇关于检测两个字符串之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆