比较字符串的百分比匹配 [英] Compare Strings for Percentage Match

查看:137
本文介绍了比较字符串的百分比匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,



好​​吧,我现在正在尝试不同的技术,将我的头撞在墙上一段时间。他们都没有运作良好。



我有两个字符串。我需要比较它们并获得确切的匹配百分比,



ie。 四分和七年前对于scor和sevn yeres来说



好​​吧,我首先将每个单词与每个单词进行比较,跟踪每个单词,和百分比=计数\ numOfWords。不,没有考虑拼写错误的单词。 (4<>for即使它已经关闭)



然后我开始尝试比较每个字符中的每个字符,增加字符串字符串如果不匹配(计算拼写错误)。但是,我会得到错误的命中,因为第一个字符串可能在第二个字符串中包含每个字符,但不是第二个字符串的确切顺序。 (东西可用<>stu vail(但它会回来,低百分比,但是命中率.9 \ 11 = 81%))



SO,然后我尝试比较每个字符串中的字符对。如果string1 [i] = string2 [k] AND string1 [i + 1] = string2 [k + 1],则递增计数,并在不匹配时递增k(以跟踪误导。for和四分应该以75%的命中率回归。)这似乎也不起作用。它越来越近,但即使完全匹配,它也只会返回94%。当事情真的拼写错误时,它真的搞砸了。 (底部代码)



有任何想法或指示吗?



谢谢,


Josh





Hello all,

Ok, I am banging my head against the wall for a while now trying different techniques. None of them are working well.

I have two strings. I need to compare them and get an exact percentage of match,

ie. "four score and seven years ago" TO "for scor and sevn yeres ago"

Well, I first started by comparing every word to every word, tracking every hit, and percentage = count \ numOfWords. Nope, didn't take into account misspelled words. ("four" <> "for" even though it is close)

Then I started by trying to compare every char in each char, incrementing the string char if not a match (to count for misspellings). But, I would get false hits because the first string could have every char in the second but not in the exact order of the second. ("stuff avail" <> "stu vail" (but it would come back as such, low percentage, but a hit. 9 \ 11 = 81%))

SO, I then tried comparing PAIRS of chars in each string. If string1[i] = string2[k] AND string1[i+1] = string2[k+1], increment the count, and increment the "k" when it doesn't match (to track mispellings. "for" and "four" should come back with a 75% hit.) That doesn't seem to work either. It is getting closer, but even with an exact match it is only returns 94%. And then it really gets screwed up when something is really misspelled. (Code at the bottom)

Any ideas or directions to go?

Thanks,

Josh


count = 0
j = 0
k = 0
While j < strTempName.Length - 2 And k < strTempFile.Length - 2
    ' To ignore non letters or digits '
    If Not strTempName(j).IsLetter(strTempName(j)) Then
        j += 1
    End If

    ' To ignore non letters or digits '
    If Not strTempFile(k).IsLetter(strTempFile(k)) Then
        k += 1
    End If

    ' compare pair of chars '
    While (strTempName(j) <> strTempFile(k) And _ 
           strTempName(j + 1) <> strTempFile(k + 1) And _ 
           k < strTempFile.Length - 2)
        k += 1
    End While
    count += 1
    j += 1
    k += 1

End While

perc = count / (strTempName.Length - 1)

推荐答案

您可以使用 Levenshtein距离 [ ^ ]算法。这是众所周知的算法,易于实现。

这个 [ ^ ]页面包含算法的 Java / C ++ / VB 实现。

和< a href =http://blogs.msdn.com/b/toub/archive/2006/05/05/590814.aspx>这里 [ ^ ]你可以找到这个算法的通用实现(这一次在 C#,但转换为VB.NET应该不是问题。)



我希望这会有所帮助。 :)
You can use Levenshtein Distance[^] algorithm. It is very well known algorithm with easy implementation.
This[^] page contains Java/C++/VB implementations of the algorithm.
And here[^] you can find generic implementation of this algorithm (this time in C#, but converting to VB.NET should not be a problem).

I hope this helps. :)


这可能会有助于作为基础。你需要修改它。



要记住的要点:

1)它逐字符比较

2 )跳过下一场比赛的角色

3)等待单词结束

4)当第一个字符串上的新单词开始时跳转到下一个单词



May be this will help as a bases. You need to modify it.

Points to remember:
1) It compares character by character
2) Skips characters until next match
3) Wait at the end of word
4) Jumps to next word when new word starts on first string

Function Compare(ByVal str1 As String, ByVal str2 As String) As Double
  Dim count As Integer = If(str1.Length > str2.Length, str1.Length, str2.Length)
  Dim hits As Integer = 0
  Dim i, j As Integer : i = 0 : j = 0
  For i = 0 To str1.Length - 1
    If str1.Chars(i) = " " Then i += 1 : j = str2.IndexOf(" "c, j) + 1 : hits += 1
    While j < str2.Length AndAlso str2.Chars(j) <> " "c
      If str1.Chars(i) = str2.Chars(j) Then
        hits += 1
        j += 1
        Exit While
      Else
        j += 1
      End If
    End While
    If Not (j < str2.Length AndAlso str2.Chars(j) <> " "c) Then
      j -= 1
    End If
  Next
  Return Math.Round((hits / count), 2)
End Function





样品输出:

四< - >for= 0.75

four stud< - >for铆钉= 0.89



Sample Output:
"four"<->"for" = 0.75
"four stud"<->"for studs" = 0.89


这篇关于比较字符串的百分比匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆