查找字符串中的重复的内容? [英] Find duplicate content in string?

查看:91
本文介绍了查找字符串中的重复的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你将如何解决以下问题:

How would you solve the following problem:

我有一个文本(约10页)半大文件,我想找到这个文字重复的内容。更具体地,给定的一个字符串,找出两个最长字符串相同。

I have a semi-large file with text (about 10 pages) and I want to find duplicate content in this text. To be more specific, given a string, find the two longest strings that are identical.

我一直在寻找的最长公共子:

I've been looking at the Longest common subsequence:

http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Longest_common_subsequence

但这些实现两个字符串作为输入。

But these implementations take two strings as input.

也许有一个服务已经这样做了?

Maybe there's a service already doing this?

推荐答案

下面是一个简单的(但效率不高)算法:循环中的所有可能的子串的长度,从最高到1。对于每个长度,把该长度的所有子串成一本字典。如果你发现一个重复的,停下来。它必须是规模最大的一次。下面是相应的C#code:

Here is a simple (but inefficient) algorithm: Loop all possible substring lengths, from the maximum down to 1. For each length, put all substrings of that length into a dictionary. If you find a duplicate, stop. It must be the largest one. Here is the corresponding C# code:

    public static string FindDuplicateSubstring(string s)
    {
        for (int len = s.Length-1; len > 0; len--) {
            var dict = new Dictionary<string, int>();
            for (int i = 0; i <= s.Length - len; i++) {
                string sub = s.Substring(i, len);
                if (dict.ContainsKey(sub)) return sub;
                else dict[sub] = i;
            }
        }
        return null;
    }

例如,当适用于您的问题的文本,最长重复子串是执行。需要注意的是重叠的子字符串是允许的,即输入BBBB收益BBB。这是不是从你的问题清楚,如果你想排除重叠的情况。对于一个更快的方法,请参阅我的其他的答案。

For example, when applied to the text of your question, the longest repeated substring is "implementation". Note that overlapping substrings are allowed, i.e. the input "bbbb" returns "bbb". It wasn't clear from your question if you wanted to exclude the overlapping case. For a faster approach, see my other answer.

这篇关于查找字符串中的重复的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆