AWK:递归下降CSV分析器 [英] AWK: Recursive Descent CSV Parser

查看:97
本文介绍了AWK:递归下降CSV分析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

响应 均已尝试将其翻译成AWK脚本,以比较使用这些脚本语言进行数据处理的速度.由于有几个缓解因素,因此该翻译不是1:1的翻译,但对于有兴趣的人,此实现在字符串处理方面比其他实现更快.

In response to a Recursive Descent CSV parser in BASH, I (the original author of both posts) have made the following attempt to translate it into AWK script, for speed comparison of data processing with these scripting languages. The translation is not a 1:1 translation due to several mitigating factors, but to those who are interested, this implementation is faster at string processing than the other.

最初,由于乔纳森·莱夫勒(Jonathan Leffler),我们解决了几个问题.标题为CSV时,我们已将代码更新为DSV,这意味着您可以在需要时将任何单个字符指定为字段定界符.

Originally we had a few questions that have all been quashed thanks to Jonathan Leffler. While the title says CSV, we've updated the code to DSV which means you can specify any single character as a field delimiter should you find it necessary.

此代码现在可以进行对决了.

This code is now ready for showdown.

基本功能

  • 对输入长度,字段长度或字段数没有强制限制
  • 带双引号的文字加引号字段 "
  • 部分中定义
  • ANSI C转义序列1.1.2 [1] [2] [3]
  • 自定义输入分隔符: UNIX编程的艺术(DSV) [4]
  • 自定义输出定界符 [5]
  • UCS-2和UCS-4转义序列 [6]
  • No imposed limitations on input length, field length, or field count
  • Literal Quoted Fields via double quote "
  • ANSI C Escape Sequences as defined here in section 1.1.2[1][2][3]
  • Custom Input Delimiter: The Art Of UNIX Programming(DSV)[4]
  • Custom Output Delimiter[5]
  • UCS-2 and UCS-4 Escape Sequences[6]

[1] 带引号的字段是原义内容,因此对带引号的内容不执行转义序列解释.但是,可以在单个字段中连接引号,纯文本和解释的序列,以实现所需的效果.例如:

[1]Quoted fields are literal content, therefore no escape sequence interpretations are performed on quoted content. One can however concatenate quotes, plain text and interpreted sequences in a single field to achieve the desired effect. For example:

one,two,three:\t"Little Endians," and one Big Endian Chief

是CSV的三字段行,其中第三字段等效于

Is a three field line of CSV where the third field is equivalent to:

three:        Little Endians, and one Big Endian Chief

[2] 在参考资料中被描述为特定于实现"或具有未定义行为"的示例将不被支持,因为它们在定义上是不可移植的,或者过于模糊以至于无法可靠.如果此处或参考资料中未定义转义序列,则反斜杠将被忽略,并且最后面的单个字符将被视为纯文本值.不支持整数值字符转义序列,这是一种不可靠的方法,无法在多个平台之间很好地扩展,并且不必要地增加了通过验证代理进行解析的复杂性.

[2]The examples described at the reference material as "implementation specific", or possessing "undefined behavior" will not be supported as they are not portable by definition, or too ambiguous to be reliable. If an escape sequence is not defined here or in the reference material, the backslash will be ignored and the single-most following character will be treated as a plain-text value. Integer value character escape sequences will not be supported it is an unreliable method that does not scale well across multiple platforms and unnecessarily, increases the complexity of parsing by proxy of validation.

[3] 八进制字符转义必须为3位数的八进制格式.如果不是3位数的八进制转义序列,则为单位数的空转义序列.十六进制转义序列必须为两位数的十六进制格式.如果转义序列标识符后面的前两个字符无效,则不会进行任何解释,并且会在标准错误上显示一条消息.任何剩余的十六进制数字都将被忽略.

[3]Octal character escapes must be in 3-digit octal format. If it is not a 3-digit octal escape sequence it is a single digit null escape sequence. Hexadecimal escape sequences must be in the 2-digit hexadecimal format. If the first two characters following the escape sequence identifier are invalid, no interpretation will take place and a message will be printed on standard error. Any remaining hexadecimal digits are ignored.

[4] 自定义输入分隔符iDelimiter必须为单个字符.多行记录将不受支持,并且这种矛盾的用法应始终被禁止.这降低了数据记录的可移植性,使其特定于其位置和来源(在该文件内)可能未知的文件.例如,grep为内容添加文件可能会返回不完整的记录,因为内容可能从任何前一行开始,从而将数据获取限制为对数据库进行完全自上而下的解析.

[4]The custom input delimiter iDelimiter must be a single character. Multi-line records will not be supported and usage of such a contradiction should always be frowned upon. This decreases the portability of a data record making it specific to a file whose location and origin (within that file) may be unknown. For instance, greping a file for content may possibly return an incomplete record because the content may begin on any previous line, limiting data acquisition to full top-down parsing of the database.

[5] 自定义输出定界符oDelimiter可以是任何所需的字符串值.脚本输出始终由单个换行符终止.这是正确的终端应用程序输出的功能.否则,您解析的CSV输出和终端提示将占用同一行,从而造成混乱的情况.同样,大多数解释器(如控制台)都是基于行的设备,它们期望换行符表示I/O记录的结束.如果您发现结尾的换行符不合适,请剪掉它.

[5]The custom output delimiter oDelimiter may be any desirable string value. Script output is always terminated by a single newline. This is a feature of correct terminal application output. Otherwise, your parsed CSV output and terminal prompt would consume the same line creating a confusing situation. Also, most interpreters, like consoles are line based devices, who expect a newline to signal the end of an I/O record. If you find the trailing newline undesirable, trim it off.

[6] 可通过以下符号获得16位Unicode转义序列:

[6]16-bit Unicode escape sequences are available via the following notation:

 \uHHHH Unicode character with hex value HHHH (4 digits)

通过以下方式支持

和32位Unicode转义序列:

and 32-bit Unicode escape sequences are supported via:

 \UHHHHHHHH Unicode character with hex value HHHHHHHH (8 digits)


特别感谢SO社区的所有成员,他们的经验,时间和投入使我创建了一个非常有用的信息处理工具.


Special Thanks to all Members of the SO community whose experience, time and input led me to create such a wonderfully useful tool for information handling.

代码清单:dsv.awk

#!/bin/awk -f
#
###############################################################
#
# ZERO LIABILITY OR WARRANTY LICENSE YOU MAY NOT OWN ANY
# COPYRIGHT TO THIS SOFTWARE OR DATA FORMAT IMPOSED HEREIN 
# THE AUTHOR PLACES IT IN THE PUBLIC DOMAIN FOR ALL USES 
# PUBLIC AND PRIVATE THE AUTHOR ASKS THAT YOU DO NOT REMOVE
# THE CREDIT OR LICENSE MATERIAL FROM THIS DOCUMENT.
#
###############################################################
#
# Special thanks to Jonathan Leffler, whose wisdom, and 
# knowledge defined the output logic of this script.
#
# Special thanks to GNU.org for the base conversion routines.
#
# Credits and recognition to the original Author:
# Triston J. Taylor whose countless hours of experience,
# research and rationalization have provided us with a
# more portable standard for parsing DSV records.
#
###############################################################
#
# This script accepts and parses a single line of DSV input
# from <STDIN>.
#
# Record fields are seperated by command line varibale
# 'iDelimiter' the default value is comma.
#
# Ouput is seperated by command line variable 'oDelimiter' 
# the default value is line feed.
#
# To learn more about this tool visit StackOverflow.com:
#
# http://stackoverflow.com/questions/10578119/
#
# You will find there a wealth of information on its
# standards and development track.
#
###############################################################

function NextSymbol() {

    strIndex++;
    symbol = substr(input, strIndex, 1);

    return (strIndex < parseExtent);

}

function Accept(query) {

    #print "query: " query " symbol: " symbol
    if ( symbol == query ) {
        #print "matched!"        
        return NextSymbol();         
    }

    return 0;

}

function Expect(query) {

    # special case: empty query && symbol...
    if ( query == nothing && symbol == nothing ) return 1;

    # case: else
    if ( Accept(query) ) return 1;

    msg = "dsv parse error: expected '" query "': found '" symbol "'";
    print msg > "/dev/stderr";

    return 0;

}

function PushData() {

    field[fieldIndex++] = fieldData;
    fieldData = nothing;

}

function Quote() {

    while ( symbol != quote && symbol != nothing ) {
        fieldData = fieldData symbol;
        NextSymbol();
    }

    Expect(quote);

}

function GetOctalChar() {

    qOctalValue = substr(input, strIndex+1, 3);

    # This isn't really correct but its the only way
    # to express 0-255. On unicode systems it won't
    # matter anyway so we don't restrict the value
    # any further than length validation.

    if ( qOctalValue ~ /^[0-7]{3}$/ ) {

        # convert octal to decimal so we can print the
        # desired character in POSIX awks...

        n = length(qOctalValue)
        ret = 0
        for (i = 1; i <= n; i++) {
            c = substr(qOctalValue, i, 1)
            if ((k = index("01234567", c)) > 0)
            k-- # adjust for 1-basing in awk
            ret = ret * 8 + k
        }

        strIndex+=3;
        return sprintf("%c", ret);

        # and people ask why posix gets me all upset..
        # Special thanks to gnu.org for this contrib..

    }

    return sprintf("\0"); # if it wasn't 3 digit octal just use zero

}

function GetHexChar(qHexValue) {

    rHexValue = HexToDecimal(qHexValue);
    rHexLength = length(qHexValue);

    if ( rHexLength ) {

        strIndex += rHexLength;
        return sprintf("%c", rHexValue);

    }

    # accept no non-sense!
    printf("dsv parse error: expected " rHexLength) > "/dev/stderr";
    printf("-digit hex value: found '" qHexValue "'\n") > "/dev/stderr";

}

function HexToDecimal(hexValue) {

    if ( hexValue ~ /^[[:xdigit:]]+$/ ) {

        # convert hex to decimal so we can print the
        # desired character in POSIX awks...

        n = length(hexValue)
        ret = 0
        for (i = 1; i <= n; i++) {

            c = substr(hexValue, i, 1)
            c = tolower(c)

            if ((k = index("0123456789", c)) > 0)
                k-- # adjust for 1-basing in awk
            else if ((k = index("abcdef", c)) > 0)
                k += 9

            ret = ret * 16 + k
        }

        return ret;

        # and people ask why posix gets me all upset..
        # Special thanks to gnu.org for this contrib..

    }

    return nothing;

}

function BackSlash() {

    # This could be optimized with some constants.
    # but we generate the data here to assist in
    # translation to other programming languages.

    if (symbol == iDelimiter) { # separator precedes all sequences
        fieldData = fieldData symbol;
    } else if (symbol == "a") { # alert
        fieldData = sprintf("%s\a", fieldData);
    } else if (symbol == "b") { # backspace
        fieldData = sprintf("%s\b", fieldData);
    } else if (symbol == "f") { # form feed
        fieldData = sprintf("%s\f", fieldData);
    } else if (symbol == "n") { # line feed
        fieldData = sprintf("%s\n", fieldData);
    } else if (symbol == "r") { # carriage return
        fieldData = sprintf("%s\r", fieldData);
    } else if (symbol == "t") { # horizontal tab
        fieldData = sprintf("%s\t", fieldData);
    } else if (symbol == "v") { # vertical tab
        fieldData = sprintf("%s\v", fieldData);
    } else if (symbol == "0") { # null or 3-digit octal character
        fieldData = fieldData GetOctalChar();
    } else if (symbol == "x") { # 2-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 2) );
    } else if (symbol == "u") { # 4-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 4) );
    } else if (symbol == "U") { # 8-digit hexadecimal character 
        fieldData = fieldData GetHexChar( substr(input, strIndex+1, 8) );
    } else { # symbol didn't match the "interpreted escape scheme"
        fieldData = fieldData symbol; # just concatenate the symbol
    }

    NextSymbol();

}

function Line() {

    if ( Accept(quote) ) {
        Quote();
        Line();
    }

    if ( Accept(backslash) ) {
        BackSlash();
        Line();        
    }

    if ( Accept(iDelimiter) ) {
        PushData();
        Line();
    }

    if ( symbol != nothing ) {
        fieldData = fieldData symbol;
        NextSymbol();
        Line();
    } else if ( fieldData != nothing ) PushData();

}

BEGIN {

    # State Variables
    symbol = ""; fieldData = ""; strIndex = 0; fieldIndex = 0;

    # Output Variables
    field[itemIndex] = "";

    # Control Variables
    parseExtent = 0;

    # Formatting Variables (optionally set on invocation line)
    if ( iDelimiter != "" ) {
        # the algorithm in place does not support multi-character delimiter
        if ( length(iDelimiter) > 1 ) { # we have a problem
            msg = "dsv parse: init error: multi-character delimiter detected:";
            printf("%s '%s'", msg, iDelimiter);
            exit 1;
        }
    } else {
        iDelimiter = ",";
    }
    if ( oDelimiter == "" ) oDelimiter = "\n";

    # Symbol Classes
    nothing = "";
    quote = "\"";
    backslash = "\\";

    getline input;

    parseExtent = (length(input) + 2);

    # parseExtent exceeds length because the loop would terminate
    # before parsing was complete otherwise.

    NextSymbol();
    Line();
    Expect(nothing);

}

END {

    if (fieldIndex) {

        fieldIndex--;

        for (i = 0; i < fieldIndex; i++)
        {
             printf("%s", field[i] oDelimiter);
        }

        print field[i];

    } 

}


如何像专业人士一样"运行脚本

# Spit out some CSV "newline" delimited:
echo 'one,two,three,AWK,CSV!' | awk -f dsv.awk

# Spit out some CSV "tab" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\t' -f dsv.awk

# Spit out some CSV "ASCII Group Separator" delimited:
echo 'one,two,three,AWK,CSV!' | awk -v oDelimiter=$'\29' -f dsv.awk

如果您需要一些自定义输出控件分隔符,但不确定要使用什么分隔符,则可以咨询

If you need some custom output control separators but aren't sure what to use, you may consult this handy ASCII chart

未来计划:

  • C library Implementation
  • C Console Application Implementation
  • Submission to The Internet Engineering Task Force for Possible Standardization

哲学

转义序列应始终用于在基于行的数据库中创建多行字段数据,并且引号应始终用于保留和连接记录字段内容.这是实现此类型的记录解析器的最简单(因此效率最高)的方法.我鼓励所有软件开发人员和教育机构坚持并遵循这一方向,以确保可移植性和准确获取基于行的定界符分隔的记录.

Escape sequences should always be used to create multi-line field data in a line based database, and quoting should always be used to preserve and concatenate record field content. This is the simplest (and therefore most efficient) way to implement a record parser of this type. I encourage all software developers and educational institutions to take up and profess this direction to ensure portability and exact acquisition of line based delimiter separated records.

CSV没有任何官方规范.记录类型.我希望作为拥有15年以上经验的开发人员,它将成为便携式CSV/DSV记录的正式认可标准.

CSV has no official specification other than RFC 4180 and it does not define any useful portable record types. It is my hope as a developer with experience of over 15 years this will become the officially recognized standard for Portable CSV/DSV Records.

推荐答案

代码的原始版本中有太多空白行,这使得它很难阅读.修订后的代码减少了空行,更容易阅读;相关行位于可以一起读取的块中.谢谢.

There were way too many blank lines in the original version of the code, which made it hard to read. The revised code with reduced blank lines is much more easily read; related lines are in blocks that can be read together. Thanks.

awk像C;它将0视为假,将非零的任何内容视为true.因此,大于0的值是正确的,但小于0的值也是正确的.

awk is like C; it treats 0 as false and anything non-zero as true. So, anything greater than 0 is true, but so is anything less than 0.

在标准 awk . GNU AWK记录了print "message" > "/dev/stderr"的使用(名称为字符串!),并暗示它甚至可以在没有实际设备的系统上工作.在装有/dev/stderr设备的系统上,它也可以与标准awk一起使用.

There isn't a direct way to print to stderr in standard awk. GNU AWK documents the use of print "message" > "/dev/stderr" (name as string!) and implies that it might work even on systems without the actual device. It will work with standard awk too on systems with the /dev/stderr device.

用于处理数组中每个索引的awk习惯用法是for (i in array) { ... }.然而, 由于您有一个索引itmIndex来告诉您数组中有多少个项目,因此您应该使用

The awk idiom for processing each index in an array is for (i in array) { ... }. However, since you have an index, itmIndex, telling you how many items are in the array, you should use

for (i = 0; i < itmIndex; i++) { printf("%s%s", item[i], delim); }

,然后在末尾输出换行符.这对我的思维方式造成了太多定界符,但这是bash代码正在做的事情的抄写.我通常的窍门是:

and then output a newline at the end. That gets one delimiter too many to my way of thinking, but that's a transcription of what the bash code is doing. My usual trick for this is:

pad = ""
for (i = 0; i < itmIndex; i++)
{
     printf("%s%s", pad, item[i])
     pad = delim
}
print "";

您可以使用-v var=value将变量传递到脚本中(或省略-v).请参阅前面列出的POSIX URL.

You can pass variables into the script with -v var=value (or omit the -v). See the POSIX URL listed before.

这篇关于AWK:递归下降CSV分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆