处理大文件在C# [英] Processing Huge Files In C#

查看:111
本文介绍了处理大文件在C#的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有我要执行基于字节的查找和替换一个4GB的文件。我写了一个简单的程序来做到这一点,但它需要时间太长(90分+)做的只是一个查找和替换。我已经尝试了一些十六进制编辑器可以在3分钟内执行任务,不要将整个目标文件加载到内存中。有谁知道在哪里我可以完成同样的事情的方法?这里是我当前的代码:

 公众诠释ReplaceBytes(字符串文件,字节[]查找,字节[]更换)
{
变种流=新的FileStream(文件,FileMode.Open,FileAccess.ReadWrite);
INT FindPoint = 0;
INT结果= 0;
为(长I = 0; I< Stream.Length;我++)
{
如果(查看[FindPoint] == Stream.ReadByte())
{
FindPoint ++;
如果(FindPoint> Find.Length - 1)
{
++的结果;
FindPoint = 0;
Stream.Seek(-Find.Length,SeekOrigin.Current);
Stream.Write(替换,0,Replace.Length);
}
}
,否则
{
FindPoint = 0;
}
}
Stream.Close();
返回结果;
}



查找和替换与4GB的文件相比,相对较小的方式。我可以很容易地看到为什么我的算法是缓慢的,但我不知道我怎么会做的更好。


解决方案

有很多在一个字符串找到一个子更好的算法(这基本上是你在做什么)



从这里开始:



http://en.wikipedia.org/wiki/String_searching_algorithm



其中的要点是,你可以通过分析你的子跳过了很多字节。这里有一个简单的例子




4GB文件开头:ABCDEFGHIJKLMNOP



您子是:NOP





  1. 您跳过子-1​​的长度和核对最后一个字节,所以比较C到P

  2. 它不匹配,所以子也不是第3个字节

  3. 此外,C是不是在以子所有的,所以你可以跳过3个字节(子的LEN)

  4. 比较F到P,不匹配,F是不是子,跳过3

  5. 比较我到p,等,等



如果你匹配,往后走。如果字符不匹配,但在子串,那么你必须做更多的一些在该点(阅读链接了解详细信息)比较


I have a 4Gb file that I want to perform a byte based find and replace on. I have written a simple program to do it but it takes far too long (90 minutes+) to do just one find and replace. A few hex editors I have tried can perform the task in under 3 minutes and don't load the entire target file into memory. Does anyone know a method where I can accomplish the same thing? Here is my current code:

    public int ReplaceBytes(string File, byte[] Find, byte[] Replace)
    {
        var Stream = new FileStream(File, FileMode.Open, FileAccess.ReadWrite);
        int FindPoint = 0;
        int Results = 0;
        for (long i = 0; i < Stream.Length; i++)
        {
            if (Find[FindPoint] == Stream.ReadByte())
            {
                FindPoint++;
                if (FindPoint > Find.Length - 1)
                {
                    Results++;
                    FindPoint = 0;
                    Stream.Seek(-Find.Length, SeekOrigin.Current);
                    Stream.Write(Replace, 0, Replace.Length);
                }
            }
            else
            {
                FindPoint = 0;
            }
        }
        Stream.Close();
        return Results;
    }

Find and Replace are relatively small compared with the 4Gb "File" by the way. I can easily see why my algorithm is slow but I am not sure how I could do it better.

解决方案

There are lots of better algorithms for finding a substring in a string (which is basically what you are doing)

Start here:

http://en.wikipedia.org/wiki/String_searching_algorithm

The gist of them is that you can skip a lot of bytes by analyzing your substring. Here's a simple example

4GB File starts with: A B C D E F G H I J K L M N O P

Your substring is: N O P

  1. You skip the length of the substring-1 and check against the last byte, so compare C to P
  2. It doesn't match, so the substring is not the first 3 bytes
  3. Also, C isn't in the substring at all, so you can skip 3 more bytes (len of substring)
  4. Compare F to P, doesn't match, F isn't in substring, skip 3
  5. Compare I to P, etc, etc

If you match, go backwards. If the character doesn't match, but is in the substring, then you have to do some more comparing at that point (read the link for details)

这篇关于处理大文件在C#的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆