在文件上扫描文本,行数很多 [英] Textscan on file with large number of lines

查看:103
本文介绍了在文件上扫描文本,行数很多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用MATLAB中的textscan分析一个非常大的文件.该文件的大小约为12 GB,包含约2.5亿行,每行中有七个(浮动)数字(以空格分隔);因为这显然不适合我的桌面RAM,所以我正在使用

要调试,我尝试使用fseek函数转到文件中的多个位置,例如:

fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');

我假设我在这里使用fseek的方式将位置指示器移动到文件的10%左右. (我知道这并不一定意味着指标在一行的开头,但是如果我两次运行textscan,我将得到一个满意的答案.)现在,如果用fileInfo.bytes/2代替fileInfo.bytes/10(即将其移至文件的50%左右),所有内容都会崩溃,并且textscan仅返回一个空的1x7单元格.

我使用文本编辑器查看了大文件,并且整个文件看起来还不错,并且没有理由混淆textscan.我能想到的唯一可能的解释是,在我不了解的更深层次上出了点问题.任何建议将不胜感激!

编辑

我的代码的相关部分过去通常是这样的:

while ~feof(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
        %// do some stuff
end

首先,我尝试按照以下Hoki的建议使用ftellfseek对其进行修复.这给出了与我之前完全相同的错误:MATLAB无法读取超过大约43%的文件.然后,我尝试使用HeaderLines解决方案(也在下面建议),如下所示:

i = 0;
while ~feof(fid)
    frewind(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
        %// do some stuff
    i = i + 1;
end

这似乎可以读入数据而不会产生错误;但是,它非常慢.

我不确定我是否完全理解HeaderLines在这种情况下的作用,但这似乎使textscan完全忽略了指定行之前的所有内容.当以适当"的方式使用textscan时(无论是否使用ftellfseek),似乎都不会发生这种情况:在两种情况下,它都试图从其最后一个位置继续,但是由于以下原因而无济于事某些原因我还不明白.

解决方案

fseek仅当您确切知道要移动光标的位置(或字节数)时,文件中的指针才有用.当您只想跳过某些已知长度的记录时,它对于二进制文件非常有用.但是在文本文件上,它比任何东西都更加危险和混乱(,除非您完全确定每一行的大小相同,并且该行上的每个元素都位于相同的确切位置/列,但这不会发生经常).

有几种方法可以逐块读取文本文件:

1)使用HeaderLines选项

要简单地跳过文本文件中的一行代码,可以使用textscanHeaderLines参数,例如:

readFormat = '%f %f %f %f %f %f %f' ;   %// read format specifier
nLines = 10000 ;                        %// number of line to read per block

fileInfo = dir(fileName);

%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
    %// Now do something with your "M" data

然后,当您要阅读第二个块时:

%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)

如果您有很多块,请对Nth块进行调整:

%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)

如果有必要(如果您有很多块),只需循环编码最后一个版本即可.

请注意,如果您在每次读取块后关闭文件,则这样做很好(因此,再次打开文件时,文件指针将从文件的开头开始).如果您的处理过程可能需要很长时间或可能出错(如果您崩溃时,您不想让文件保持打开时间太长或丢失fid,那么关闭文件,则关闭文件更安全).


2)按块读取(不关闭文件)

如果该块的处理足够快速且安全,因此您确定它不会被炸毁,那么您可以关闭该文件.在这种情况下,textscan文件指针将停留在您停止的位置,因此您也可以:

  • 读取一个块(不关闭文件):M = textscan(fid, readFormat, nLines)
  • 处理它,然后保存结果(并释放内存)
  • 使用相同的调用读取下一个块:M = textscan(fid, readFormat, nLines)

在这种情况下,您不需要headerlines参数,因为textscan将继续准确地读取停止位置.


3)使用ftellfseek

最后,您可以使用fseek在所需的精确位置开始读取文件,但是在这种情况下,我建议将其与MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).

To debug, I attempted to go to several positions in the file using the fseek function, for example like this:

fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');

I'm assuming that the way I'm using fseek here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10 by fileInfo.bytes/2 (i.e. moving it to about 50% of the file) everything breaks down and textscan only returns an empty 1x7 cell.

I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for textscan to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated!

EDIT

The relevant part of my code used to look like this:

while ~feof(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
        %// do some stuff
end

First I tried fixing it using ftell and fseek as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines solution (also suggested below), like this:

i = 0;
while ~feof(fid)
    frewind(fid)
    data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
        %// do some stuff
    i = i + 1;
end

This seems to read in the data without producing errors; it is, however, incredibly slow.

I'm not entirely sure I understand what HeaderLines does in this context, but it seems to make textscan completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan in the "appropriate" way (either with or without ftell and fseek): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.

解决方案

fseek a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).

There are several ways to read a text file block by block:

1) Use the HeaderLines option

To simply skip a block of lines on a text file, you can use the HeaderLines parameter of textscan, so for example:

readFormat = '%f %f %f %f %f %f %f' ;   %// read format specifier
nLines = 10000 ;                        %// number of line to read per block

fileInfo = dir(fileName);

%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
    %// Now do something with your "M" data

Then when you want to read the second block:

%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)

And if you have many blocks, for the Nth block, just adapt:

%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)

If necessary (if you have many blocks), just code this last version in a loop.

Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the fid if you crash).


2) Read by block (without closing the file)

If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the textscan file pointer will stay where you stopped, so you could also :

  • read a block (do not close the file): M = textscan(fid, readFormat, nLines)
  • Process it then save your result (and release memory)
  • read the next block with the same call: M = textscan(fid, readFormat, nLines)

In this case you wouldn't need the headerlines parameter because textscan will resume reading exactly where it stopped.


3) use ftell and fseek

Lastly, you could use fseek to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell.

ftell will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek the next time to go straight at this position. Something like:

%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)

%// do some stuff

%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...

这篇关于在文件上扫描文本,行数很多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆