在Delphi中解析一行的最快方法是什么? [英] What is the fastest way to Parse a line in Delphi?

查看:152
本文介绍了在Delphi中解析一行的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的文件,我必须一行解析。速度是至关重要的。



一行的示例:


  Token-1下一个令牌最后一个令牌在线
^ ^
当前位置
GetToken后的位置


GetToken被调用,返回Here-is-the-Next-Token,并将CurrentPosition设置为令牌的最后一个字符,以便它可以下一次调用GetToken。令牌由一个或多个空格分隔。



假设文件已经在内存中的StringList中。它容易记忆,说200MB。



我只担心解析的执行时间。什么代码将在Delphi(Pascal)中产生绝对最快的执行?

解决方案


  • 使用PChar增加处理速度

  • 如果不需要某些令牌,只需要复制令牌数据

  • 实际扫描字符时,将PChar复制到本地变量

  • 将源数据保存在单个缓冲区中,除非您必须逐行处理,即使这样,也可以在词法分析器识别器中将线处理作为单独的令牌来处理

  • 考虑处理一个来自文件的字节数组缓冲区,如果你绝对知道编码;如果使用Delphi 2009,使用PAnsiChar而不是PChar,除非你知道编码是UTF16-LE。

  • 如果你知道唯一的空格将是#32(ASCII空间)或类似有限的字符集,可能会有一些聪明的位操纵黑客,可以让您一次处理4个字节使用整数扫描。我不会指望这里有很大的胜利,而代码会像泥浆一样清晰。



这是一个示例词法分析器,应该是相当有效,但它假定所有源数据都在一个字符串中。因为很长的令牌而重新处理缓冲区是中等的棘手。

 键入
TLexer = class
private
FData:string;
FTokenStart:PChar;
FCurrPos:PChar;
函数GetCurrentToken:string;
public
构造函数Create(const AData:string);
函数GetNextToken:Boolean;
属性CurrentToken:string read GetCurrentToken;
结束

{TLexer}

构造函数TLexer.Create(const AData:string);
begin
FData:= AData;
FCurrPos:= PChar(FData);
结束

函数TLexer.GetCurrentToken:string;
begin
SetString(Result,FTokenStart,FCurrPos - FTokenStart);
结束

函数TLexer.GetNextToken:Boolean;
var
cp:PChar;
begin
cp:= FCurrPos; //复制到本地以允许注册分配

//跳过空格;这个测试可以转换为一个无符号的int
//减法,并且只对一个分支进行比较
while(cp ^>#0)和(cp ^ <=#32)do
Inc(cp);

//对文件结尾使用空终止符
结果:= cp ^& #0;

如果结果然后
开始
FTokenStart:= cp;
Inc(cp);
while cp ^> #32 do
Inc(cp);
结束

FCurrPos:= cp;
结束


I have a huge file that I must parse line by line. Speed is of the essence.

Example of a line:

Token-1   Here-is-the-Next-Token      Last-Token-on-Line
      ^                        ^
   Current                 Position
   Position              after GetToken

GetToken is called, returning "Here-is-the-Next-Token" and sets the CurrentPosition to the position of the last character of the token so that it is ready for the next call to GetToken. Tokens are separated by one or more spaces.

Assume the file is already in a StringList in memory. It fits in memory easily, say 200 MB.

I am worried only about the execution time for the parsing. What code will produce the absolute fastest execution in Delphi (Pascal)?

解决方案

  • Use PChar incrementing for speed of processing
  • If some tokens are not needed, only copy token data on demand
  • Copy PChar to local variable when actually scanning through characters
  • Keep source data in a single buffer unless you must handle line by line, and even then, consider handling line processing as a separate token in the lexer recognizer
  • Consider processing a byte array buffer that has come straight from the file, if you definitely know the encoding; if using Delphi 2009, use PAnsiChar instead of PChar, unless of course you know the encoding is UTF16-LE.
  • If you know that the only whitespace is going to be #32 (ASCII space), or a similarly limited set of characters, there may be some clever bit manipulation hacks that can let you process 4 bytes at a time using Integer scanning. I wouldn't expect big wins here though, and the code will be as clear as mud.

Here's a sample lexer that should be pretty efficient, but it assumes that all source data is in a single string. Reworking it to handle buffers is moderately tricky due to very long tokens.

type
  TLexer = class
  private
    FData: string;
    FTokenStart: PChar;
    FCurrPos: PChar;
    function GetCurrentToken: string;
  public
    constructor Create(const AData: string);
    function GetNextToken: Boolean;
    property CurrentToken: string read GetCurrentToken;
  end;

{ TLexer }

constructor TLexer.Create(const AData: string);
begin
  FData := AData;
  FCurrPos := PChar(FData);
end;

function TLexer.GetCurrentToken: string;
begin
  SetString(Result, FTokenStart, FCurrPos - FTokenStart);
end;

function TLexer.GetNextToken: Boolean;
var
  cp: PChar;
begin
  cp := FCurrPos; // copy to local to permit register allocation

  // skip whitespace; this test could be converted to an unsigned int
  // subtraction and compare for only a single branch
  while (cp^ > #0) and (cp^ <= #32) do
    Inc(cp);

  // using null terminater for end of file
  Result := cp^ <> #0;

  if Result then
  begin
    FTokenStart := cp;
    Inc(cp);
    while cp^ > #32 do
      Inc(cp);
  end;

  FCurrPos := cp;
end;

这篇关于在Delphi中解析一行的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆