最快的Matlab文件阅读? [英] Fastest Matlab file reading?

查看:142
本文介绍了最快的Matlab文件阅读?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的MATLAB程序正在读取一个约7米长的文件,浪费了太多时间在I / O上。我知道每行都被格式化为两个整数,但我不知道他们占用了多少个字符。 str2num是死一般的慢,我应该使用什么样的matlab函数呢?

Catch:我必须在每一行操作一行而不存储整个文件内存,所以所有读取整个矩阵的命令都不在表格中。

  fid = fopen('file.txt'); 
tline = fgetl(fid);
而ischar(tline)
nums = str2num(tline);
%做数字
tline = fgetl(fid);
end
fclose(fid);


解决方案

问题陈述



这是一个共同的斗争,没有什么比测试要回答的。这里是我的假设:

$ ol

  • 格式良好的ASCII文件,包含两列数字。没有标题,没有不一致的线条等等。

  • 这个方法必须扩展到读取太大而不能包含在内存中的文件(尽管我的耐心是有限的,所以我的测试文件只有50万行)。
    实际的操作(OP称为用stuff做东西)必须在时间,不能被矢量化。 记住,答案和评论似乎在三个方面都是令人鼓舞的效率:


    • 大批量地读取文件

    • 更有效地执行字符串到数字转换(通过批处理或使用更好的函数)
    • 使实际处理更高效(我通过规则3 ,以上)。


      结果

      测试这些主题的6个变化的摄取速度(和结果的一致性)。结果是:


      • 初始代码。 68.23秒 582582 check

      • 使用sscanf,每行一次。 27.20秒582582 check

      • 大批量使用fscanf。 8.93秒。 582582 check

      • 大批量使用文本扫描。 8.79秒。 582582 check

      • 将大批量读入内存,然后sscanf。 8.15秒582582 check

      • 在单行上使用java单行文件读取器和sscanf。 63.56秒582582 check

      • 使用java单品令牌扫描器。 81.19秒582582 check

      • 完全批处理的操作(不符合)。 1.02秒508680检查(违反规则3)



      总结



      超过一半的原始时间(68-> 27秒)在str2num调用中效率低下,可以通过切换sscanf来消除。


      大约2/3的剩余时间(27 - > 8秒)可以通过使用大批量的文件阅读和字符串到数字转换来减少。



      如果我们愿意违反原来的第三条规则,另外7/8的时间可以通过切换到全数字处理来减少。但是,有些算法不适合这个,所以我们不要管它。 (不是检查值不匹配最后一项)。

      最后,直接相反的是在这个回应之前我的编辑,没有节省可用切换可用的缓存Java,单行阅读器。事实上,这个解决方案比使用原生阅读器的可比单行结果慢2-3倍。 (63比27秒)。

      上面介绍的所有解决方案的示例代码都包含在内。






      示例代码



        %%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%% 
      %%创建一个测试文件
      cd(tempdir);
      fName ='demo_file.txt';
      fid = fopen(fName,'w');
      for ixLoop = 1:5
      d = randi(1e6,1e5,2);
      fprintf(fid,'%d,%d \\\
      ',d);
      end
      fclose(fid);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%初始代码
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      tline = fgetl(fid);
      而ischar(tline)
      nums = str2num(tline);
      CHECK = round((CHECK + mean(nums))/ 2);
      tline = fgetl(fid);
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Initial code。%3.2f sec。%d check \ n',t,CHECK);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%使用sscanf,每行一次
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      tline = fgetl(fid);
      whilechar(tline)
      nums = sscanf(tline,'%d,%d');
      CHECK = round((CHECK + mean(nums))/ 2);
      tline = fgetl(fid);
      end
      fclose(fid);
      t = toc;
      fprintf(1,'使用sscanf,每行一次。%3.2f sec。%d check \ n',t,CHECK);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%大批量使用fscanf
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      fid = fopen('demo_file.txt');
      scannedData = reshape(fscanf(fid,'%d,%d',bufferSize),2,[])';
      while〜isempty(scannedData)
      for ix = 1:size(scannedData,1)
      nums = scannedData(ix,:);
      CHECK = round((CHECK + mean(nums))/ 2);
      end
      scannedData = reshape(fscanf(fid,'%d,%d',bufferSize),2,[])';
      end
      fclose(fid);
      t = toc;
      fprintf(1,'大批量使用fscanf。%3.2f秒。%d检查\\\
      ',t,CHECK);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%大量使用textscan
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      fid = fopen('demo_file.txt');
      scannedData = textscan(fid,'%d,%d \\\
      ',bufferSize);
      while〜isempty(scanningData {1})
      for ix = 1:size(scannedData {1},1)
      nums = [scannedData {1}(ix)scannedData {2}( ⅸ)];
      CHECK = round((CHECK + mean(nums))/ 2);
      end
      scannedData = textscan(fid,'%d,%d \\\
      ',bufferSize);
      end
      fclose(fid);
      t = toc;
      fprintf(1,'大批量使用textscan。%3.2f sec。%d check \\\
      ',t,CHECK);



      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%朗读批量进入内存,递增到行尾,sscanf
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      bufferSize = 1e4;
      eol = sprintf('\\\
      ');

      dataBatch = fread(fid,bufferSize,'uint8 => char')';
      dataIncrement = fread(fid,1,'uint8 => char');
      while〜isempty(dataIncrement)&& (dataIncrement(end)〜= eol)&& 〜feof(fid)
      dataIncrement(end + 1)= fread(fid,1,'uint8 => char'); %可以稍微优化
      end
      data = [dataBatch dataIncrement];

      while〜isempty(data)
      scannedData = reshape(sscanf(data,'%d,%d'),2,[])';
      for ix = 1:size(scannedData,1)
      nums = scannedData(ix,:);
      CHECK = round((CHECK + mean(nums))/ 2);
      end

      dataBatch = fread(fid,bufferSize,'uint8 => char')';
      dataIncrement = fread(fid,1,'uint8 => char');
      while〜isempty(dataIncrement)&& (dataIncrement(end)〜= eol)&& 〜feof(fid)
      dataIncrement(end + 1)= fread(fid,1,'uint8 => char');%可以稍微优化
      end
      data = [dataBatch dataIncrement];
      end
      fclose(fid);
      t = toc;
      fprintf(1,'将大批量读入内存,然后sscanf。%3.2f sec。%d check \\\
      ',t,CHECK);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%使用Java单线阅读器+ sscanf
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      reader = java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize);
      tline = char(reader.readLine());
      while〜isempty(tline)
      nums = sscanf(tline,'%d,%d');
      CHECK = round((CHECK + mean(nums))/ 2);
      tline = char(reader.readLine());
      end
      reader.close();
      t = toc;
      fprintf(1,'在单行上使用java单行文件读取器和sscanf。%3.2f sec。%d check \ n',t,CHECK);

      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%
      %%使用Java扫描程序进行文件读取和字符串转换
      CHECK = 0;
      tic;
      jFile = java.io.File('demo_file.txt');
      scanner = java.util.Scanner(jFile);
      scanner.useDelimiter('[\s\,\\\
      \r] +');
      while scanner.hasNextInt()
      nums = [scanner.nextInt()scanner.nextInt()];
      CHECK = round((CHECK + mean(nums))/ 2);
      end
      scanner.close();
      t = toc;
      fprintf(1,'使用java单品令牌扫描器。%3.2f sec。%d check \ n',t,CHECK);


      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%
      %%大量读入内存,矢量化操作(不符合标准的解决方案)
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      bufferSize = 1e4;
      eol = sprintf('\\\
      ');

      dataBatch = fread(fid,bufferSize,'uint8 => char')';
      dataIncrement = fread(fid,1,'uint8 => char');
      while〜isempty(dataIncrement)&& (dataIncrement(end)〜= eol)&& 〜feof(fid)
      dataIncrement(end + 1)= fread(fid,1,'uint8 => char'); %可以稍微优化
      end
      data = [dataBatch dataIncrement];

      while〜isempty(data)
      scannedData = reshape(sscanf(data,'%d,%d'),2,[])';
      CHECK = round((CHECK + mean(scannedData(:)))/ 2);

      dataBatch = fread(fid,bufferSize,'uint8 => char')';
      dataIncrement = fread(fid,1,'uint8 => char');
      while〜isempty(dataIncrement)&& (dataIncrement(end)〜= eol)&& 〜feof(fid)
      dataIncrement(end + 1)= fread(fid,1,'uint8 => char');%可以稍微优化
      end
      data = [dataBatch dataIncrement];
      end
      fclose(fid);
      t = toc;
      fprintf(1,'完全批处理操作。%3.2f秒。%d检查\ n',t,CHECK);






      <原始答案>



      为了扩大Ben所提出的观点,如果您逐行阅读这些文件,您的瓶颈将始终是文件I / O。



      我明白,有时你不能把整个文件放到内存中。我通常阅读大量的字符(1e5,1e6或其附近,取决于系统的内存)。然后,我要么读取额外的单个字符(或退回单个字符),以获得一系列的行,然后运行您的字符串解析(如sscanf)。

      然后如果你需要一次处理最后一行的大矩阵,然后重复这个过程,直到你读到文件的结尾。

      有点乏味,但是不那么难我通常看到比单线阅读器快90%以上的速度。




      耻辱)


      My MATLAB program is reading a file about 7m lines long and wasting far too much time on I/O. I know that each line is formatted as two integers, but I don't know exactly how many characters they take up. str2num is deathly slow, what matlab function should I be using instead?

      Catch: I have to operate on each line one at a time without storing the whole file memory, so none of the commands that read entire matrices are on the table.

      fid = fopen('file.txt');
      tline = fgetl(fid);
      while ischar(tline)
          nums = str2num(tline);    
          %do stuff with nums
          tline = fgetl(fid);
      end
      fclose(fid);
      

      解决方案

      Problem statement

      This is a common struggle, and there is nothing like a test to answer. Here are my assumptions:

      1. A well formatted ASCII file, containing two columns of numbers. No headers, no inconsistent lines etc.

      2. The method must scale to reading files that are too large to be contained in memory, (although my patience is limited, so my test file is only 500,000 lines).

      3. The actual operation (what the OP calls "do stuff with nums") must be performed one row at a time, cannot be vectorized.

      Discussion

      With that in mind, the answers and comments seem to be encouraging efficiency in three areas:

      • reading the file in larger batches
      • performing the string to number conversion more efficiently (either via batching, or using better functions)
      • making the actual processing more efficient (which I have ruled out via rule 3, above).

      Results

      I put together a quick script to test out the ingestion speed (and consistency of result) of 6 variations on these themes. The results are:

      • Initial code. 68.23 sec. 582582 check
      • Using sscanf, once per line. 27.20 sec. 582582 check
      • Using fscanf in large batches. 8.93 sec. 582582 check
      • Using textscan in large batches. 8.79 sec. 582582 check
      • Reading large batches into memory, then sscanf. 8.15 sec. 582582 check
      • Using java single line file reader and sscanf on single lines. 63.56 sec. 582582 check
      • Using java single item token scanner. 81.19 sec. 582582 check
      • Fully batched operations (non-compliant). 1.02 sec. 508680 check (violates rule 3)

      Summary

      More than half of the original time (68 -> 27 sec) was consumed with inefficiencies in the str2num call, which can be removed by switching the sscanf.

      About another 2/3 of the remaining time (27 -> 8 sec) can be reduced by using larger batches for both file reading and string to number conversions.

      If we are willing to violate rule number three in the original post, another 7/8 of the time can be reduced by switching to a fully numeric processing. However, some algorithms do not lend themselves to this, so we leave it alone. (Not the "check" value does not match for the last entry.)

      Finally, in direct contradiction a previous edit of mine within this response, no savings are available by switching the the available cached Java, single line readers. In fact that solution is 2 -- 3 times slower than the comparable single line result using native readers. (63 vs. 27 seconds).

      Sample code for all of the solutions described above are included below.


      Sample code

      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Create a test file
      cd(tempdir);
      fName = 'demo_file.txt';
      fid = fopen(fName,'w');
      for ixLoop = 1:5
          d = randi(1e6, 1e5,2);
          fprintf(fid, '%d, %d \n',d);
      end
      fclose(fid);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Initial code
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      tline = fgetl(fid);
      while ischar(tline)
          nums = str2num(tline);
          CHECK = round((CHECK + mean(nums) ) /2);
          tline = fgetl(fid);
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Using sscanf, once per line
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      tline = fgetl(fid);
      while ischar(tline)
          nums = sscanf(tline,'%d, %d');
          CHECK = round((CHECK + mean(nums) ) /2);
          tline = fgetl(fid);
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Using fscanf in large batches
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      fid = fopen('demo_file.txt');
      scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
      while ~isempty(scannedData)
          for ix = 1:size(scannedData,1)
              nums = scannedData(ix,:);
              CHECK = round((CHECK + mean(nums) ) /2);
          end
          scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Using textscan in large batches
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      fid = fopen('demo_file.txt');
      scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
      while ~isempty(scannedData{1})
          for ix = 1:size(scannedData{1},1)
              nums = [scannedData{1}(ix) scannedData{2}(ix)];
              CHECK = round((CHECK + mean(nums) ) /2);
          end
          scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Reading in large batches into memory, incrementing to end-of-line, sscanf
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      bufferSize = 1e4;
      eol = sprintf('\n');
      
      dataBatch = fread(fid,bufferSize,'uint8=>char')';
      dataIncrement = fread(fid,1,'uint8=>char');
      while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
          dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
      end
      data = [dataBatch dataIncrement];
      
      while ~isempty(data)
          scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
          for ix = 1:size(scannedData,1)
              nums = scannedData(ix,:);
              CHECK = round((CHECK + mean(nums) ) /2);
          end
      
          dataBatch = fread(fid,bufferSize,'uint8=>char')';
          dataIncrement = fread(fid,1,'uint8=>char');
          while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
              dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
          end
          data = [dataBatch dataIncrement];
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Using Java single line readers + sscanf
      CHECK = 0;
      tic;
      bufferSize = 1e4;
      reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
      tline = char(reader.readLine());
      while ~isempty(tline)
          nums = sscanf(tline,'%d, %d');
          CHECK = round((CHECK + mean(nums) ) /2);
          tline = char(reader.readLine());
      end
      reader.close();
      t = toc;
      fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Using Java scanner for file reading and string conversion
      CHECK = 0;
      tic;
      jFile = java.io.File('demo_file.txt');
      scanner = java.util.Scanner(jFile);
      scanner.useDelimiter('[\s\,\n\r]+');
      while scanner.hasNextInt()
          nums = [scanner.nextInt() scanner.nextInt()];
          CHECK = round((CHECK + mean(nums) ) /2);
      end
      scanner.close();
      t = toc;
      fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);
      
      
      %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
      %% Reading in large batches into memory, vectorized operations (non-compliant solution)
      CHECK = 0;
      tic;
      fid = fopen('demo_file.txt');
      bufferSize = 1e4;
      eol = sprintf('\n');
      
      dataBatch = fread(fid,bufferSize,'uint8=>char')';
      dataIncrement = fread(fid,1,'uint8=>char');
      while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
          dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
      end
      data = [dataBatch dataIncrement];
      
      while ~isempty(data)
          scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
          CHECK = round((CHECK + mean(scannedData(:)) ) /2);
      
          dataBatch = fread(fid,bufferSize,'uint8=>char')';
          dataIncrement = fread(fid,1,'uint8=>char');
          while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
              dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
          end
          data = [dataBatch dataIncrement];
      end
      fclose(fid);
      t = toc;
      fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);
      


      (original answer)

      To expand on the point made by Ben ... your bottleneck will always be file I/O if you are reading these files line by line.

      I understand that sometimes you cannot fit a whole file into memory. I typically read in a large batch of characters (1e5, 1e6 or thereabouts, depending on the memory of your system). Then I either read additional single characters (or back off single characters) to get a round number of lines, and then run your string parsing (e.g. sscanf).

      Then if you want you can process the resulting large matrix one row at a time, before repeating the process until you read the end of the file.

      It's a little bit tedious, but not that hard. I typically see 90% plus improvement in speed over single line readers.


      (terrible idea using Java batched line readers removed in shame)

      这篇关于最快的Matlab文件阅读?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆