最快的 Matlab 文件读取? [英] Fastest Matlab file reading?

查看:41
本文介绍了最快的 Matlab 文件读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 MATLAB 程序正在读取一个大约 7m 行长的文件,并且在 I/O 上浪费了太多时间.我知道每一行都被格式化为两个整数,但我不知道它们到底占用了多少个字符.str2num 非常慢,我应该使用什么 matlab 函数?

My MATLAB program is reading a file about 7m lines long and wasting far too much time on I/O. I know that each line is formatted as two integers, but I don't know exactly how many characters they take up. str2num is deathly slow, what matlab function should I be using instead?

Catch:我必须在不存储整个文件内存的情况下一次对每一行进行操作,因此没有读取整个矩阵的命令在表中.

Catch: I have to operate on each line one at a time without storing the whole file memory, so none of the commands that read entire matrices are on the table.

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);

推荐答案

问题说明

这是一场常见的斗争,没有什么比考试更能回答的.以下是我的假设:

Problem statement

This is a common struggle, and there is nothing like a test to answer. Here are my assumptions:

  1. 格式良好的 ASCII 文件,包含两列数字.没有标题,没有不一致的行等

  1. A well formatted ASCII file, containing two columns of numbers. No headers, no inconsistent lines etc.

该方法必须扩展到读取太大而无法包含在内存中的文件,(虽然我的耐心有限,所以我的测试文件只有 500,000 行).

The method must scale to reading files that are too large to be contained in memory, (although my patience is limited, so my test file is only 500,000 lines).

实际操作(OP 称之为用 nums 做事")必须一次执行一行,不能向量化.

The actual operation (what the OP calls "do stuff with nums") must be performed one row at a time, cannot be vectorized.

讨论

考虑到这一点,答案和评论似乎在三个方面提高了效率:

Discussion

With that in mind, the answers and comments seem to be encouraging efficiency in three areas:

  • 大批量读取文件
  • 更有效地执行字符串到数字的转换(通过批处理或使用更好的函数)
  • 使实际处理更高效(我已通过上述规则 3 排除了这一点).

我编写了一个快速脚本来测试这些主题的 6 个变体的摄取速度(和结果的一致性).结果是:

I put together a quick script to test out the ingestion speed (and consistency of result) of 6 variations on these themes. The results are:

  • 初始代码.68.23 秒.582582 检查
  • 使用 sscanf,每行一次.27.20 秒.582582 检查
  • 大批量使用 fscanf.8.93 秒.582582 检查
  • 大批量使用 textscan.8.79 秒.582582 支票
  • 将大批量读入内存,然后 sscanf.8.15 秒.582582 检查
  • 在单行上使用 java 单行文件阅读器和 sscanf.63.56 秒.582582 检查
  • 使用 java 单项令牌扫描器.81.19 秒.582582 检查
  • 完全批量操作(不合规).1.02 秒.508680 检查(违反规则 3)
  • Initial code. 68.23 sec. 582582 check
  • Using sscanf, once per line. 27.20 sec. 582582 check
  • Using fscanf in large batches. 8.93 sec. 582582 check
  • Using textscan in large batches. 8.79 sec. 582582 check
  • Reading large batches into memory, then sscanf. 8.15 sec. 582582 check
  • Using java single line file reader and sscanf on single lines. 63.56 sec. 582582 check
  • Using java single item token scanner. 81.19 sec. 582582 check
  • Fully batched operations (non-compliant). 1.02 sec. 508680 check (violates rule 3)

超过一半的原始时间(68 -> 27 秒)被消耗在 str2num 调用中,这可以通过切换 sscanf 来消除.

More than half of the original time (68 -> 27 sec) was consumed with inefficiencies in the str2num call, which can be removed by switching the sscanf.

通过使用更大的批次进行文件读取和字符串到数字的转换,可以减少大约 2/3 的剩余时间(27 -> 8 秒).

About another 2/3 of the remaining time (27 -> 8 sec) can be reduced by using larger batches for both file reading and string to number conversions.

如果我们愿意违反原帖中的第三条规则,则可以通过切换到全数字处理来减少另外 7/8 的时间.然而,一些算法不适合这一点,所以我们不管它.(不是检查"值与最后一个条目不匹配.)

If we are willing to violate rule number three in the original post, another 7/8 of the time can be reduced by switching to a fully numeric processing. However, some algorithms do not lend themselves to this, so we leave it alone. (Not the "check" value does not match for the last entry.)

最后,与我之前在此回复中的编辑直接矛盾,通过切换可用的缓存 Java 单行阅读器无法节省任何费用.事实上,该解决方案比使用本机阅读器的类似单行结果慢 2 - 3 倍.(63 秒对 27 秒).

Finally, in direct contradiction a previous edit of mine within this response, no savings are available by switching the the available cached Java, single line readers. In fact that solution is 2 -- 3 times slower than the comparable single line result using native readers. (63 vs. 27 seconds).

下面包含上述所有解决方案的示例代码.

Sample code for all of the solutions described above are included below.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
    d = randi(1e6, 1e5,2);
    fprintf(fid, '%d, %d 
',d);
end
fclose(fid);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code.  %3.2f sec.  %d check 
', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check 
', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check 
', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d 
', bufferSize) ;
while ~isempty(scannedData{1})
    for ix = 1:size(scannedData{1},1)
        nums = [scannedData{1}(ix) scannedData{2}(ix)];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = textscan(fid, '%d, %d 
', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check 
', t, CHECK);



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('
');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check 
', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check 
', t, CHECK);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[s,

]+');
while scanner.hasNextInt()
    nums = [scanner.nextInt() scanner.nextInt()];
    CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check 
', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('
');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    CHECK = round((CHECK + mean(scannedData(:)) ) /2);

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations.  %3.2f sec.  %d check 
', t, CHECK);

<小时>

(原答案)


(original answer)

为了扩展 Ben 提出的观点……如果您逐行阅读这些文件,那么您的瓶颈将始终是文件 I/O.

To expand on the point made by Ben ... your bottleneck will always be file I/O if you are reading these files line by line.

我知道有时您无法将整个文件放入内存中.我通常读取大量字符(1e5、1e6 或类似的字符,具体取决于您的系统内存).然后我要么读取额外的单个字符(或取消单个字符)以获得整数行,然后运行您的字符串解析(例如 sscanf).

I understand that sometimes you cannot fit a whole file into memory. I typically read in a large batch of characters (1e5, 1e6 or thereabouts, depending on the memory of your system). Then I either read additional single characters (or back off single characters) to get a round number of lines, and then run your string parsing (e.g. sscanf).

然后,如果您愿意,可以一次一行处理生成的大矩阵,然后重复该过程直到读取文件末尾.

Then if you want you can process the resulting large matrix one row at a time, before repeating the process until you read the end of the file.

这有点乏味,但并不难.与单行阅读器相比,我通常看到速度提高了 90% 以上.

It's a little bit tedious, but not that hard. I typically see 90% plus improvement in speed over single line readers.

(使用 Java 批处理行阅读器的可怕想法被羞辱删除)

(terrible idea using Java batched line readers removed in shame)

这篇关于最快的 Matlab 文件读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆