超大单行文件解析 [英] Extremely Large Single-Line File Parse
问题描述
我正在从一个站点下载数据,该站点以非常大块的形式将数据提供给我.在非常大的块中,有我需要单独解析的块".这些块"以(ClinicalData)"开头并以(/ClinicalData)"结尾.因此,示例字符串将类似于:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
在理想"情况下,块是单行数据,但有时会出现错误的换行符.由于我想解析块内的 (ClinicalData) 块,我想让我的数据可以逐行解析.因此,我将文本文件全部读入 StringBuilder,删除换行符(以防万一),然后插入我自己的换行符,这样我就可以逐行阅读了.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);//需要清除换行符以防它们存在.dataToWrite.Replace("
", "");//设置我自己的换行符,以便数据可以逐行解析dataToWrite.Replace("<ClinicalData", "
<ClinicalData");//将数据设置回一个文件,然后在 StreamReader 中使用该文件按行解析.File.WriteAllText(filepath, dataToWrite.ToString());
这一直很有效(虽然可能效率不高,但至少对我很友好:)),直到我没有遇到作为 280MB 大文件提供给我的大量数据.>
现在我收到一个 System.OutOfMemoryException 与这个块,我只是想不出解决它的方法.我认为问题在于 StringBuilder 无法处理 280MB 的纯文本?好吧,我尝试过字符串拆分、regex.match 拆分和各种其他方法将其分解为有保证的(ClinicalData) 块,但我继续遇到内存异常.我也没有运气尝试阅读预定义块(例如:使用 .ReadBytes).
关于如何处理 280MB 大的、可能但实际上可能不是单行文本的任何建议都会很棒!
这是读取文本文件的一种极其低效的方式,更不用说大文件了.如果您只需要一次传递、替换或添加单个字符,则应使用 StreamReader
.如果你只需要一个前瞻字符,你只需要维护一个中间状态,比如:
enum ReadState{开始,锯开}使用 (var sr = new StreamReader(@"path oclinic.txt"))使用 (var sw = new StreamWriter(@"path ooutput.txt")){var rs = ReadState.Start;而(真){var r = sr.Read();如果 (r <0){如果(rs == ReadState.SawOpen)sw.Write('<');休息;}字符 c = (字符) r;if ((c == '
') || (c == '
'))继续;如果(rs == ReadState.SawOpen){如果 (c == 'C')sw.WriteLine();sw.Write('<');rs = ReadState.Start;}如果 (c == '<'){rs = ReadState.SawOpen;继续;}sw.Write(c);}}
I am downloading data from a site and the site gives the data to me in very large blocks. Within the very large block, there are "chunks" that I need to parse individually. These "chunks" begin with "(ClinicalData)" and end with "(/ClinicalData)". Therefore, an example string would look something like:
(ClinicalData)(ID="1")(/ClinicalData)(ClinicalData)(ID="2")(/ClinicalData)(ClinicalData)(ID="3")(/ClinicalData)(ClinicalData)(ID="4")(/ClinicalData)(ClinicalData)(ID="5")(/ClinicalData)
Under "ideal" circumstances, the block is meant to be one-single line of data, however sometimes there are erroneous newline characters. Since I want to parse the (ClinicalData) chunks within the block, I want to make my data parse-able line-by-line. Therefore, I take the text file, read it all into a StringBuilder, remove new-lines (just in case), and then insert my own newlines, that way I can read line-by-line.
StringBuilder dataToWrite = new StringBuilder(File.ReadAllText(filepath), Int32.MaxValue);
// Need to clear newline characters just in case they exist.
dataToWrite.Replace("
", "");
// set my own newline characters so the data becomes parse-able by line
dataToWrite.Replace("<ClinicalData", "
<ClinicalData");
// set the data back into a file, which is then used in a StreamReader to parse by lines.
File.WriteAllText(filepath, dataToWrite.ToString());
This has been working out great (albeit maybe not efficient, but at least it is friendly to me :)), until I have not encountered a chunk of data that is being given to me as a 280MB large file.
Now I am getting a System.OutOfMemoryException with this block and I just cannot figure out a way around it. I believe the issue is that StringBuilder cannot handle 280MB of straight text? Well, I have tried string splits, regex.match splits, and various other ways to break it into guaranteed "(ClinicalData) chunks, but I continue to get the memory exception. I have also had no luck in attempting to read pre-defined chunks (e.g.: using .ReadBytes).
Any suggestions on how to handle a 280MB large, potentially-but-might-not-actually-be single line of text would be great!
That's an extremely inefficient way to read a text file, let alone a large one. If you only need one pass, replacing or adding individual characters, you should use a StreamReader
. If you only need one character of lookahead you only need to maintain a single intermediate state, something like:
enum ReadState
{
Start,
SawOpen
}
using (var sr = new StreamReader(@"path oclinic.txt"))
using (var sw = new StreamWriter(@"path ooutput.txt"))
{
var rs = ReadState.Start;
while (true)
{
var r = sr.Read();
if (r < 0)
{
if (rs == ReadState.SawOpen)
sw.Write('<');
break;
}
char c = (char) r;
if ((c == '
') || (c == '
'))
continue;
if (rs == ReadState.SawOpen)
{
if (c == 'C')
sw.WriteLine();
sw.Write('<');
rs = ReadState.Start;
}
if (c == '<')
{
rs = ReadState.SawOpen;
continue;
}
sw.Write(c);
}
}
这篇关于超大单行文件解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!