非常大的单行文件解析 [英] Extremely Large Single-Line File Parse

查看：108 发布时间：2016/8/30 14:26:24 c# string file-io

本文介绍了非常大的单行文件解析的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从一个网站下载数据，该网站给出的数据对我的非常的大块。在非常大的块，也有块，我需要单独分析。这些豆腐块开头（临床资料），并结束（/临床资料）。因此，例如串看起来是这样的：

I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:

(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)

在理想的情况下，该块是为了成为一个单行数据，但有时也有错误的换行符。因为我想解析块内的（临床资料）块，我想让我的数据分析，能行由行。因此，我把文本文件，读取所有到一个StringBuilder，消除新线（以防万一），然后将我自己的换行，这样我可以读行由行。

Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.

StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);

// Need to clear newline characters just in case they exist.
dataToWrite.Replace("\n", "");

// set my own newline characters so the data becomes parse-able by line 
dataToWrite.Replace("<ClinicalData", "\n<ClinicalData");

// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());

这已经工作了巨大的（尽管可能效率不高，但至少它是友好的对我说：）），直到我还没有遇到过，正在给我一个280MB大文件的数据块。

This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.

现在我得到这个块一个System.OutOfMemoryException我只是想不出办法解决它。我相信这个问题是StringBuilder的不能处理连胜文280MB？好了，我已经试过串拆分，拆分regex.match，等各种方式，打破它变成保证（临床资料）块，但我继续得到内存异常。我也没有运气在试图读取预先定义块。（如：使用.ReadBytes）

Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).

这是如何处理280MB任何建议大的，潜在的，但是，也许 - 不实际，是单行文本会是伟大的！

Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!

推荐答案

这是读取文本文件中的一个极其低效的方式，更何况是大的。如果你只需要一个传球，更换或添加单个字符，你应该使用的StreamReader 如果您只需要向前看一个字符，你只需要维护一个中间状态，是这样的：

That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:

enum ReadState
{
    Start,
    SawOpen
}


using (var sr = new StreamReader(@"path\to\clinic.txt"))
using (var sw = new StreamWriter(@"path\to\output.txt"))
{
    var rs = ReadState.Start;
    while (true)
    {
        var r = sr.Read();
        if (r < 0)
        {
            if (rs == ReadState.SawOpen)
                sw.Write('<');
            break;
        }

        char c = (char) r;
        if ((c == '\r') || (c == '\n'))
            continue;

        if (rs == ReadState.SawOpen)
        {
            if (c == 'C')
                sw.WriteLine();

            sw.Write('<');
            rs = ReadState.Start;
        }

        if (c == '<')
        {
            rs = ReadState.SawOpen;
            continue;
        }

        sw.Write(c);
    }
}

这篇关于非常大的单行文件解析的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

非常大的单行文件解析 [英] Extremely Large Single-Line File Parse

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

非常大的单行文件解析 [英] Extremely Large Single-Line File Parse

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭