模糊的文字(句子/职称)的匹配在C# [英] Fuzzy text (sentences/titles) matching in C#

查看:128
本文介绍了模糊的文字(句子/职称)的匹配在C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嘿,我使用 Levenshteins 算法得到的源和目标字符串之间的距离。

Hey, I'm using Levenshteins algorithm to get distance between source and target string.

我也有方法,它返回值从0到1:

also I have method which returns value from 0 to 1:

/// <summary>
/// Gets the similarity between two strings.
/// All relation scores are in the [0, 1] range, 
/// which means that if the score gets a maximum value (equal to 1) 
/// then the two string are absolutely similar
/// </summary>
/// <param name="string1">The string1.</param>
/// <param name="string2">The string2.</param>
/// <returns></returns>
public static float CalculateSimilarity(String s1, String s2)
{
    if ((s1 == null) || (s2 == null)) return 0.0f;

    float dis = LevenshteinDistance.Compute(s1, s2);
    float maxLen = s1.Length;
    if (maxLen < s2.Length)
        maxLen = s2.Length;
    if (maxLen == 0.0F)
        return 1.0F;
    else return 1.0F - dis / maxLen;
}

不过这对我来说是不够的。因为我需要更复杂的方式来满足两句话。

but this for me is not enough. Because I need more complex way to match two sentences.

例如我想自动标记一些音乐,我有原创歌曲的名字,和我有歌曲垃圾一样的超,质量,的几年如的 2007年,2008年,的etc..etc ..还有一些文件刚刚 HTTP://trash..thash..song_name_mp3.mp3 ,其他是正常的。我想创建一种算法,能够工作得更加完美比我现在..也许有人能帮助我吗?

For example I want automatically tag some music, I have original song names, and i have songs with trash, like super, quality, years like 2007, 2008, etc..etc.. also some files have just http://trash..thash..song_name_mp3.mp3, other are normal. I want to create an algorithm which will work just more perfect than mine now.. Maybe anyone can help me?

下面是我目前的算法中:

here is my current algo:

/// <summary>
/// if we need to ignore this target.
/// </summary>
/// <param name="targetString">The target string.</param>
/// <returns></returns>
private bool doIgnore(String targetString)
{
    if ((targetString != null) && (targetString != String.Empty))
    {
        for (int i = 0; i < ignoreWordsList.Length; ++i)
        {
            //* if we found ignore word or target string matching some some special cases like years (Regex).
            if (targetString == ignoreWordsList[i] || (isMatchInSpecialCases(targetString))) return true;
        }
    }

   return false;
}

/// <summary>
/// Removes the duplicates.
/// </summary>
/// <param name="list">The list.</param>
private void removeDuplicates(List<String> list)
{
    if ((list != null) && (list.Count > 0))
    {
        for (int i = 0; i < list.Count - 1; ++i)
        {
            if (list[i] == list[i + 1])
            {
                list.RemoveAt(i);
                --i;
            }
        }
    }
}

/// <summary>
/// Does the fuzzy match.
/// </summary>
/// <param name="targetTitle">The target title.</param>
/// <returns></returns>
private TitleMatchResult doFuzzyMatch(String targetTitle)
{
    TitleMatchResult matchResult = null;

   if (targetTitle != null && targetTitle != String.Empty)
   {
       try
       {
           //* change target title (string) to lower case.
           targetTitle = targetTitle.ToLower();

           //* scores, we will select higher score at the end.
           Dictionary<Title, float> scores = new Dictionary<Title, float>();

           //* do split special chars: '-', ' ', '.', ',', '?', '/', ':', ';', '%', '(', ')', '#', '\"', '\'', '!', '|', '^', '*', '[', ']', '{', '}', '=', '!', '+', '_'
           List<String> targetKeywords = new List<string>(targetTitle.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));

          //* remove all trash from keywords, like super, quality, etc..
           targetKeywords.RemoveAll(delegate(String x) { return doIgnore(x); });
          //* sort keywords.
          targetKeywords.Sort();
        //* remove some duplicates.
        removeDuplicates(targetKeywords);

        //* go through all original titles.
        foreach (Title sourceTitle in titles)
        {
            float tempScore = 0f;
            //* split orig. title to keywords list.
            List<String> sourceKeywords = new List<string>(sourceTitle.Name.Split(ignoreCharsList, StringSplitOptions.RemoveEmptyEntries));
            sourceKeywords.Sort();
            removeDuplicates(sourceKeywords);

            //* go through all source ttl keywords.
            foreach (String keyw1 in sourceKeywords)
            {
                float max = float.MinValue;
                foreach (String keyw2 in targetKeywords)
                {
                    float currentScore = StringMatching.StringMatching.CalculateSimilarity(keyw1.ToLower(), keyw2);
                    if (currentScore > max)
                    {
                        max = currentScore;
                    }
                }
                tempScore += max;
            }

            //* calculate average score.
            float averageScore = (tempScore / Math.Max(targetKeywords.Count, sourceKeywords.Count)); 

            //* if average score is bigger than minimal score and target title is not in this source title ignore list.
            if (averageScore >= minimalScore && !sourceTitle.doIgnore(targetTitle))
            {
                //* add score.
                scores.Add(sourceTitle, averageScore);
            }
        }

        //* choose biggest score.
        float maxi = float.MinValue;
        foreach (KeyValuePair<Title, float> kvp in scores)
        {
            if (kvp.Value > maxi)
            {
                maxi = kvp.Value;
                matchResult = new TitleMatchResult(maxi, kvp.Key, MatchTechnique.FuzzyLogic);
            }
        }
    }
    catch { }
}
//* return result.
return matchResult;
}

这工作正常,但只是在某些情况下,很多职称应匹配,不匹配的...我想我需要某种配方配重块等游戏,但我不能想到一个。

This works normally but just in some cases, a lot of titles which should match, does not match... I think I need some kind of formula to play with weights and etc, but i can't think of one..

想法?建议?交易算法?

Ideas? Suggestions? Algos?

顺便说一句,我已经知道这个话题(我的同事已经公布,但我们不能跟此问题的妥善解决。): <一href="http://stackoverflow.com/questions/49263/approximate-string-matching-algorithms">http://stackoverflow.com/questions/49263/approximate-string-matching-algorithms

by the way I already know this topic (My colleague already posted it but we cannot come with a proper solution for this problem.): http://stackoverflow.com/questions/49263/approximate-string-matching-algorithms

推荐答案

在这里您的问题可能干扰词和有用的数据区分:

Your problem here may be distinguishing between noise words and useful data:

  • Rolling_Stones.Best_of_2003.Wild_Horses.mp3
  • Super.Quality.Wild_Horses.mp3
  • Tori_Amos.Wild_Horses.mp3

您可能需要产生的噪音字的字典,不容忽视。这似乎是笨重的,但我不知道有一个算法,可以带/专辑名称和噪音区分开来。

You may need to produce a dictionary of noise words to ignore. That seems clunky, but I'm not sure there's an algorithm that can distinguish between band/album names and noise.

这篇关于模糊的文字(句子/职称)的匹配在C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆