在文件上扫描文本,行数很多 [英] Textscan on file with large number of lines
问题描述
我正在尝试使用MATLAB中的 要调试,我尝试使用 我假设我在这里使用 我使用文本编辑器查看了大文件,并且整个文件看起来还不错,并且没有理由混淆 编辑 我的代码的相关部分过去通常是这样的: 首先,我尝试按照以下Hoki的建议使用 这似乎可以读入数据而不会产生错误;但是,它非常慢. 我不确定我是否完全理解 有几种方法可以逐块读取文本文件: 要简单地跳过文本文件中的一行代码,可以使用 然后,当您要阅读第二个块时: 如果您有很多块,请对 如果有必要(如果您有很多块),只需循环编码最后一个版本即可. 请注意,如果您在每次读取块后关闭文件,则这样做很好(因此,再次打开文件时,文件指针将从文件的开头开始).如果您的处理过程可能需要很长时间或可能出错(如果您崩溃时,您不想让文件保持打开时间太长或丢失 如果该块的处理足够快速且安全,因此您确定它不会被炸毁,那么您可以关闭该文件.在这种情况下, 在这种情况下,您不需要 最后,您可以使用 To debug, I attempted to go to several positions in the file using the I'm assuming that the way I'm using I looked at the file using a text editor for large files, and this shows that the entire file looks fine, and that there should be no reason for EDIT The relevant part of my code used to look like this: First I tried fixing it using This seems to read in the data without producing errors; it is, however, incredibly slow. I'm not entirely sure I understand what There are several ways to read a text file block by block: To simply skip a block of lines on a text file, you can use the Then when you want to read the second block: And if you have many blocks, for the If necessary (if you have many blocks), just code this last version in a loop. Note that this is good if you close your file after each block reading (so the file pointer will start at the beginning of the file when you open it again). Closing the file after reading a block of data is safer if your processing might take a long time or may error out (you don't want to have files which remain open too long or loose the If the processing of the block is quick and safe enough so you're sure it won't bomb out, you could afford to not close the file. In this case, the In this case you wouldn't need the Lastly, you could use
这篇关于在文件上扫描文本,行数很多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!textscan
分析一个非常大的文件.该文件的大小约为12 GB,包含约2.5亿行,每行中有七个(浮动)数字(以空格分隔);因为这显然不适合我的桌面RAM,所以我正在使用
fseek
函数转到文件中的多个位置,例如:fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
fseek
的方式将位置指示器移动到文件的10%左右. (我知道这并不一定意味着指标在一行的开头,但是如果我两次运行textscan
,我将得到一个满意的答案.)现在,如果用fileInfo.bytes/2
代替fileInfo.bytes/10
(即将其移至文件的50%左右),所有内容都会崩溃,并且textscan
仅返回一个空的1x7单元格.textscan
.我能想到的唯一可能的解释是,在我不了解的更深层次上出了点问题.任何建议将不胜感激! while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
ftell
和fseek
对其进行修复.这给出了与我之前完全相同的错误:MATLAB无法读取超过大约43%的文件.然后,我尝试使用HeaderLines
解决方案(也在下面建议),如下所示:i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
HeaderLines
在这种情况下的作用,但这似乎使textscan
完全忽略了指定行之前的所有内容.当以适当"的方式使用textscan
时(无论是否使用ftell
和fseek
),似乎都不会发生这种情况:在两种情况下,它都试图从其最后一个位置继续,但是由于以下原因而无济于事某些原因我还不明白.fseek
仅当您确切知道要移动光标的位置(或字节数)时,文件中的指针才有用.当您只想跳过某些已知长度的记录时,它对于二进制文件非常有用.但是在文本文件上,它比任何东西都更加危险和混乱(,除非您完全确定每一行的大小相同,并且该行上的每个元素都位于相同的确切位置/列,但这不会发生经常). 1)使用
HeaderLines
选项textscan
的HeaderLines
参数,例如:readFormat = '%f %f %f %f %f %f %f' ; %// read format specifier
nLines = 10000 ; %// number of line to read per block
fileInfo = dir(fileName);
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
%// Now do something with your "M" data
%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)
Nth
块进行调整:%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)
fid
,那么关闭文件,则关闭文件更安全).>
2)按块读取(不关闭文件)
textscan
文件指针将停留在您停止的位置,因此您也可以:
M = textscan(fid, readFormat, nLines)
M = textscan(fid, readFormat, nLines)
headerlines
参数,因为textscan
将继续准确地读取停止位置.
3)使用
ftell
和fseek
fseek
在所需的精确位置开始读取文件,但是在这种情况下,我建议将其与MATLAB documentation (i.e. loading and analyzing a smaller block of the file at a time. According to the documentation this should allow for processing "arbitrarily large delimited text file[s]"). This only allows me to scan about 43% of the file, after which textscan starts returning empty cells (despite there still being data left to scan in the file).fseek
function, for example like this:fileInfo = dir(fileName);
fid = fileopen(fileName);
fseek(fid, floor(fileInfo.bytes/10), 'bof');
textscan(fid,'%f %f %f %f %f %f %f','Delimiter',' ');
fseek
here moves the position indicator to about 10% of my file. (I'm aware this doesn't necessarily mean the indicator is at the beginning of a line, but if I run textscan
twice I get a satisfactory answer.) Now, if I substitute fileInfo.bytes/10
by fileInfo.bytes/2
(i.e. moving it to about 50% of the file) everything breaks down and textscan
only returns an empty 1x7 cell.textscan
to be confused. The only possible explanation that I can think of is that something goes wrong on a much deeper level that I have little understanding of. Any suggestions would be greatly appreciated! while ~feof(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter', ' '); %// Read nLines
%// do some stuff
end
ftell
and fseek
as suggested by Hoki below. This gave exactly the same error as I got before: MATLAB was unable to read in more than approximately 43% of the file. Then I tried using the HeaderLines
solution (also suggested below), like this:i = 0;
while ~feof(fid)
frewind(fid)
data = textscan(fid, FormatString, nLines, 'Delimiter',' ', 'HeaderLines', i*nLines);
%// do some stuff
i = i + 1;
end
HeaderLines
does in this context, but it seems to make textscan
completely ignore everything that comes before the specified line. This doesn't seem to happen when using textscan
in the "appropriate" way (either with or without ftell
and fseek
): in both cases it tries to continue from its last position, but to no avail because of some reason I don't understand yet.fseek
a pointer in a file is only good when you know precisely where (or by how many bytes) you want to move the cursor. It is very useful for binary files when you just want to skip some records of known length. But on a text file it is more dangerous and confusing than anything (unless you are absolutely sure that each line is the same size and each element on the line is at the same exact place/column, but that doesn't happen often).1) Use the
HeaderLines
optionHeaderLines
parameter of textscan
, so for example:readFormat = '%f %f %f %f %f %f %f' ; %// read format specifier
nLines = 10000 ; %// number of line to read per block
fileInfo = dir(fileName);
%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' '); %// read the first 10000 lines
fclose(fid)
%// Now do something with your "M" data
%// later read the SECOND block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', nLines); %// read lines 10001 to 20000
fclose(fid)
Nth
block, just adapt:%// and then for the Nth BLOCK block:
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ','HeaderLines', (N-1)*nLines);
fclose(fid)
fid
if you crash).
2) Read by block (without closing the file)
textscan
file pointer will stay where you stopped, so you could also :
M = textscan(fid, readFormat, nLines)
M = textscan(fid, readFormat, nLines)
headerlines
parameter because textscan
will resume reading exactly where it stopped.
3) use
ftell
and fseek
fseek
to start reading the file at the precise position you want, but in this case I recommend using it in conjunction with ftell
.ftell
will return the current position in an open file, so use that to know at which position you stop reading last, then use fseek
the next time to go straight at this position. Something like:%// read FIRST block
fid = fileopen(fileName);
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// do some stuff
%// then read another block:
fid = fileopen(fileName);
fseek( fid , 'bof' , lastPosition ) ;
M = textscan(fid, readFormat, nLines,'Delimiter',' ');
lastPosition = ftell(fid) ;
fclose(fid)
%// and so on ...