如何在大字符串中查找重复的短语 [英] How to find duplicate phrases in a large string

查看:29
本文介绍了如何在大字符串中查找重复的短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出一种在大字符串中查找重复短语的有效方法.该字符串将包含由空格分隔的数百或数千个单词.我在下面包含了我目前正在使用的代码,但在查找重复短语方面效率很低.

 public static string FindDuplicateSubstringFast(string s, string keyword, bool allowOverlap = true){int matchPos = 0, maxLength = 0;如果 (s.ToLower().Contains(keyword.ToLower()))for (int shift = 1; shift < s.Length; shift++){int matchCount = 0;for (int i = 0; i < s.Length - shift; i++){如果 (s[i] == s[i + shift]){匹配计数++;if (matchCount > maxLength){maxLength = matchCount;matchPos = i - matchCount + 1;}if (!allowOverlap && (matchCount == shift)){//我们找到了最大的允许匹配//对于这个班次.休息;}}否则匹配计数 = 0;}}字符串 newbs = s.Substring(matchPos, maxLength);if (maxLength > 3) return s.Substring(matchPos, maxLength);否则返回空;}

我找到了上面的示例代码@在字符串中查找重复内容?

此方法遍历每个字符,我想找到一种方法来遍历每个单词.我不确定这样做的最佳方法是什么.我想我可以在空白处拆分字符串,然后将单词放入列表中.遍历列表应该比像我现在所做的那样遍历每个字符更有效.但是,我不知道如何遍历列表并找到重复的短语.

如果有人能帮我找出一种算法来遍历列表以查找重复的短语,我将不胜感激.我也愿意接受在大字符串中查找重复短语的任何其他想法或方法.

如果需要更多信息,请告诉我.

这是一个大字符串的例子{对于这个例子来说很小}

<块引用>

Lorem Ipsum 只是印刷和排版的虚拟文本行业.Lorem Ipsum 一直是业界标准的虚拟文本自 1500 年代以来.

例如,Lorem Ipsum"将是重复的短语.我需要返回Lorem Ipsum"以及在字符串中多次出现的任何其他重复短语.

解决方案

string[] split = BigString.Split(' ').ToLower();var duplicates = new Dictionary();for (int i = 0;i

现在,字典将包含所有短语和子短语",例如Lorem Ipsum Dolor"会找到Lorem Ipsum"和Lorem Ipsum Dolor".如果您对此不感兴趣,那么只需遍历 Keysduplicates 集合即可.如果一个键是另一个键的子串,并且它们的值相同,则删除该键.

I am trying to figure out an efficient way to find duplicate phrases in a large string. The string will contain hundreds or thousands of words separated by an empty space. I've included code below that I am currently using but it is very inefficient in finding duplicate phrases.

    public static string FindDuplicateSubstringFast(string s, string keyword, bool allowOverlap = true)
{
    int matchPos = 0, maxLength = 0;
    if (s.ToLower().Contains(keyword.ToLower()))
        for (int shift = 1; shift < s.Length; shift++)
        {
            int matchCount = 0;
            for (int i = 0; i < s.Length - shift; i++)
            {

                if (s[i] == s[i + shift])
                {
                    matchCount++;
                    if (matchCount > maxLength)
                    {
                        maxLength = matchCount;
                        matchPos = i - matchCount + 1;
                    }
                    if (!allowOverlap && (matchCount == shift))
                    {
                        // we have found the largest allowable match 
                        // for this shift.
                        break;
                    }
                }
                else matchCount = 0;
            }
        }
    string newbs = s.Substring(matchPos, maxLength);
    if (maxLength > 3) return s.Substring(matchPos, maxLength);
    else return null;
}

I found the example code above @ Find duplicate content in string?

This method is going through every char and I would like to find a way to loop through each word. I'm not sure what would be the best way to do this. I was thinking I could split the string on the empty spaces and then put the words into a list. Iterating through a list should be way more efficient than iterating over every char like I am doing now. However, I don't know how I would iterate through the list and find duplicate phrases.

If anyone could help me figure out an algorithm to iterate through a list to find duplicate phrases, I would be very grateful. I would also be open to any other ideas or methods to find duplicate phrases within a large string.

Please let me know if any more info is needed.

EDIT: Here is an example of a large string {its small for this example}

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

For example sake "Lorem Ipsum" would be the duplicate phrase. I need to return "Lorem Ipsum" and any other duplicate phrases that appear in the string more than once.

解决方案

string[] split = BigString.Split(' ').ToLower();
var duplicates = new Dictionary<string, int>();
for (int i = 0;i<split.Length;i++)
{
    int j=i;
    string s = split[i] + " ";
    while(i+j<split.Length)
    {
        j++;
        s += split[j] + " ";
        if (Regex.Matches(BigString.ToLower(), s).Count ==1) break;
        duplicates[s] = Regex.Matches(BigString.ToLower(), s).Count;
    }
}

Now, the dictionary will contain all the phrases and "subphrases" e.g. "Lorem Ipsum Dolor" will find "Lorem Ipsum" and "Lorem Ipsum Dolor". If that's not interesting to you, it's just a matter of looping through the Keys Collection of duplicates. If one key is a substring of another key, and their value is the same, remove said key.

这篇关于如何在大字符串中查找重复的短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆