如何以最高效的方式解析 CSV 文件? [英] How to parse CSV file in the most performant way?

查看:45
本文介绍了如何以最高效的方式解析 CSV 文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在以下条件下以最高效的方式解析 ABAP 中的大型 CSV 文件:

  1. 我们不知道 CSV 的结构-> 解析结果应该是 string_table 的表或类似的东西
  2. 解析应该按照

    通常,这使用模式 position = find( off = position + 1 ) 以块的形式迭代字符串,然后使用 substring 将范围复制到字符串.这里可以观察到的是,在迭代一百万次的循环中,节省的每一纳秒都会对性能产生影响,通过将尽可能多的时间移出内循环,可以显着提高性能.对于简单"在 10 位字段的情况下,可以看到两种算法的性能同样好,但是对于更长的"相比之下,您的算法在 30 位字段中变得越来越快.对于带引号的字段,扫描 &我使用的 concat 方法似乎比重建"方法更快.方法.我想虽然可以通过更聪明的 ABAP 获得小幅收益,但只有更多地利用引擎才能进行进一步的重大优化.

    无论如何,这是算法:

    CLASS lcl_csv_parser_find 实现.方法解析.数据行类型 string_table.数据位置类型 i.数据(string_length) = strlen( i_string )."取消引用成员字段比变量访问稍慢,在闭环中这很重要DATA(分隔符) = me->分隔符.数据(定界符)=我->定界符.检查 string_length <>0."检查 DO 循环中的分隔符非常慢."通过扫描整个文件一次并跳过检查是否不存在分隔符"这导致 100 万行的性能略微提高了 1 秒DATA(next_delimiter) = find( val = i_string sub = delimiter ).做.数据(开始位置)= 位置.数据(字段)= ``."检查字段是否用双引号括起来,因为我们需要取消转义IF next_delimiter <>-1 AND i_string+position(1) = 分隔符.起始位置 = 起始位置 + 1.打开引号后开始文字做.position = find( val = i_string off = position + 1 sub = delimiter )."文字必须关闭"断言位置 <>-1.DATA(subliteral_length) = 位置 - start_position.字段 = 字段 &&substring( val = i_string off = start_position len = subliteral_length ).DATA(following_position) = 位置 + 1.IF position = string_length OR i_string+following_position(1) <>分隔符."到达文字结尾位置 = 位置 + 1.跳过结束语出口."做别的."找到了转义引号位置 = 以下_位置 + 1.字段 = 字段 &&me-> 分隔符."继续搜索万一."断言 sy-index <1000.结束.别的."未转义的字段,只需找到结尾的逗号或换行符position = find_any_of( val = i_string off = position + 1 sub = separators ).IF 位置 = -1.位置 = 字符串长度.万一.field = substring( val = i_string off = start_position len = position - start_position ).万一.将字段附加到行."检查行是否结束并开始新行DATA(current) = substring( val = i_string off = position len = 2 ).IF 电流 = me->line_separator.将行附加到 r_result.清除线.位置 = 位置 + 2.跳过换行别的."断言 i_string+position(1) = me-> 分隔符.位置 = 位置 + 1.万一."检查文件是否结束IF 位置 >= string_length.返回.万一."断言 sy-index <100000001.结束.结束方法.完结.


    作为旁注,我不会像#1 中那样创建一个巨大的字符串字段表,而是尝试使用某种访问者模式",例如将此类接口的实例传递给解析器:

    接口 if_csv_visitor.方法 begin_line.方法 end_line.方法visit_field输入i_field 类型字符串.端接口.

    在很多情况下,您无论如何都会将 CSV 字段写入结构中,这样就可以节省分配这个相当大的表.


    为了进一步参考,这里是整个报告:

    *&---------------------------------------------------------------------**&报告 Z_CSV*&--------------------------------------------------------------**&*&--------------------------------------------------------------*报告 Z_CSV.* --------------------- 通用 CSV 解析器 ----------------------------*类 lcl_csv_parser 定义摘要.公共部分.类型:t_string_matrix 带有空键的 string_table 类型标准表.方法:解析摘要输入i_string 类型字符串返回值(r_result)类型t_string_matrix,构造函数输入i_delimiter TYPE string DEFAULT '"'i_separator TYPE 字符串 DEFAULT ','i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.保护部分.数据:分隔符 TYPE 字符串,分隔符 TYPE 字符串,line_separator TYPE 字符串,escaped_delimiter TYPE 字符串,分隔符 TYPE 字符串.完结.类 lcl_csv_parser 实现.方法构造函数.me->delimiter = i_delimiter.me->separator = i_separator.我-> line_separator = i_line_separator.me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.me->separators = i_separator &&i_line_separator.结束方法.完结.* --------------------------- 基于行的 CSV 解析器 ------------------------ *类 lcl_csv_parser_line 定义继承自 lcl_csv_parser.公共部分.方法解析重新定义.私人部门.方法 parse_line_to_string_table输入i_line 类型字符串返回VALUE(r_result) TYPE string_table.完结.类 lcl_csv_parser_line 实现.方法解析."得到线条SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).数据 open_line 类型 abap_bool 值 abap_false.DATA current_line 类型字符串.LOOP AT 线分配字段符号().在 <line> 中查找所有出现的 me->delimiter字符模式匹配计数数据(计数).如果(计数 MOD 2)= 1.如果 open_line = abap_true.current_line = |{ current_line }{ me->line_separator }{ }|.open_line = abap_false.将 parse_line_to_string_table( current_line ) 附加到 r_result.别的.current_line = <线>.open_line = abap_true.万一.别的.如果 open_line = abap_true.current_line = |{ current_line }{ me->line_separator }{ }|.别的.将 parse_line_to_string_table( <line> ) 添加到 r_result.万一.万一.端环.结束方法.方法 parse_line_to_string_table.SPLIT i_line AT me->separator INTO TABLE DATA(fields).数据 open_field 类型 abap_bool 值 abap_false.DATA current_field 类型字符串.LOOP AT 字段 ASSIGNING FIELD-SYMBOL().在 <field> 中查找所有出现的 me->delimiter字符模式匹配计数数据(计数).如果(计数 MOD 2)= 1.如果 open_field = abap_true.current_field = |{ current_field }{ me->separator }{ }|.open_field = abap_false.将 current_field 附加到 r_result.别的.current_field = <field>.open_field = abap_true.万一.别的.如果 open_field = abap_true.current_field = |{ current_field }{ me->separator }{ }|.别的.追加<字段>到 r_result.万一.万一.端环.将表 r_result 中所有出现的 me->escaped_delimiter 替换为 me->delimiter.结束方法.完结.*--------------- 基于查找的 CSV 解析器 ------------------------------------*类 lcl_csv_parser_find 定义继承自 lcl_csv_parser.公共部分.方法解析重新定义.完结.类 lcl_csv_parser_find 实现.方法解析.数据行类型 string_table.数据位置类型 i.数据(string_length) = strlen( i_string )."取消引用成员字段比变量访问稍慢,在闭环中这很重要DATA(分隔符) = me->分隔符.数据(定界符)=我->定界符.检查 string_length <>0."检查 DO 循环中的分隔符非常慢."通过扫描整个文件一次并跳过检查是否不存在分隔符"这导致 100 万行的性能略微提高了 1 秒DATA(next_delimiter) = find( val = i_string sub = delimiter ).做.数据(开始位置)= 位置.数据(字段)= ``."检查字段是否用双引号括起来,因为我们需要取消转义IF next_delimiter <>-1 AND i_string+position(1) = 分隔符.起始位置 = 起始位置 + 1.打开引号后开始文字做.position = find( val = i_string off = position + 1 sub = delimiter )."文字必须关闭"断言位置 <>-1.DATA(subliteral_length) = 位置 - start_position.字段 = 字段 &&substring( val = i_string off = start_position len = subliteral_length ).DATA(following_position) = 位置 + 1.IF position = string_length OR i_string+following_position(1) <>分隔符."到达文字结尾位置 = 位置 + 1.跳过结束语出口."做别的."找到了转义引号位置 = 以下_位置 + 1.字段 = 字段 &&me-> 分隔符."继续搜索万一."断言 sy-index <1000.结束.别的."未转义的字段,只需找到结尾的逗号或换行符position = find_any_of( val = i_string off = position + 1 sub = separators ).IF 位置 = -1.位置 = 字符串长度.万一.field = substring( val = i_string off = start_position len = position - start_position ).万一.将字段附加到行."检查行是否结束并开始新行DATA(current) = substring( val = i_string off = position len = 2 ).IF 电流 = me->line_separator.将行附加到 r_result.清除线.位置 = 位置 + 2.跳过换行别的."断言 i_string+position(1) = me-> 分隔符.位置 = 位置 + 1.万一."检查文件是否结束IF 位置 >= string_length.返回.万一."断言 sy-index <100000001.结束.结束方法.完结.* -------------------- 测试 --------------------------------------------------------- *类 lcl_test_csv_parser 定义最终的创造公众.公共部分.类方法运行.类方法 get_file_complex返回值(r_result)类型字符串.类方法 get_file_simple返回值(r_result)类型字符串.类方法 get_file_long返回值(r_result)类型字符串.类方法 get_file_longer返回值(r_result)类型字符串.类方法 get_file_mixed返回值(r_result)类型字符串.保护部分.私人部门.完结.类 lcl_test_csv_parser 实现.方法 get_file_complex.数据(文件行)=重复(val = |1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9)&&|1234,{ cl_abap_char_utilities=>cr_lf }7890"|&&cl_abap_char_utilities=>cr_lf.r_result = 重复( val = file_line occ = 1000000 ).结束方法.方法 get_file_simple.数据(文件行)=重复(val = |1234567890,| occ = 9)&&|1234567890|&&cl_abap_char_utilities=>cr_lf.r_result = 重复( val = file_line occ = 1000000 ).结束方法.方法 get_file_long.数据(文件行)=重复(val = |12345678901234567890,| occ = 4)&&|12345678901234567890|&&cl_abap_char_utilities=>cr_lf.r_result = 重复( val = file_line occ = 1000000 ).结束方法.方法 get_file_longer.数据(文件行)=重复(val = |1234567890123456789012345678901234567890,| occ = 2)&&|1234567890123456789012345678901234567890|&&cl_abap_char_utilities=>cr_lf.r_result = 重复( val = file_line occ = 1000000 ).结束方法.方法 get_file_mixed.数据(文件行)=| 1234567890,1234567890," 1234,{cl_abap_char_utilities => cr_lf} 7890",1234567890,1234567890,1234567890," 1234,{cl_abap_char_utilities => cr_lf} 7890",1234567890,1234567890,1234567890 |&&cl_abap_char_utilities=>cr_lf.r_result = 重复( val = file_line occ = 1000000 ).结束方法.方法运行.DATA prepare_start TYPE 时间戳.获取时间戳字段 prepare_start.类型:开始 t_file,名称类型字符串,内容类型字符串,t_file 结束,t_files 带有空密钥的 t_file 类型标准表.数据(文件)= VALUE t_files(( name = `simple` content = get_file_simple() )( name = `long` content = get_file_long() )( name = `longer` content = get_file_long() )( name = `complex` content = get_file_complex() )( name = `mixed` content = get_file_mixed() )).数据准备_结束类型时间戳l.获取时间戳字段 prepare_end.WRITE |准备工作 { cl_abap_tstmp=>subtract(tstmp1 = prepare_end tstmp2 = prepare_start) }|.跳过 2.写:'文件',15'行解析',30'查找解析',45'匹配'.新队.上联.LOOP AT 文件进入数据(文件).在文件"下写入文件名.DATA line_start TYPE 时间戳.获取时间戳字段 line_start.数据(line_parser) = NEW lcl_csv_parser_line().DATA(line_result) = line_parser->parse(file-content).DATA line_end TYPE 时间戳l.获取时间戳字段 line_end.写 |{ cl_abap_tstmp=>subtract(tstmp1 = line_end tstmp2 = line_start) }s|在行解析"下.数据 find_start TYPE 时间戳.获取时间戳字段 find_start.数据(find_parser) = 新的lcl_csv_parser_find().DATA(find_result) = find_parser->parse(file-content).DATA find_end TYPE 时间戳.获取时间戳字段 find_end.写 |{ cl_abap_tstmp=>subtract(tstmp1 = find_end tstmp2 = find_start) }s|在查找解析"下."WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.新队.端环.结束方法.完结.开始选择.lcl_test_csv_parser=>run().

    I would like to parse big CSV files in ABAP in the most performant way under the following conditions:

    1. We do not know the structure of the CSV->the parse result should be table of string_table or something simular
    2. The parsing should happen in accordance to https://www.rfc-editor.org/rfc/rfc4180
    3. No solution specific calls

    I found a very nice blog https://blogs.sap.com/2014/09/09/understanding-csv-files-and-their-handling-in-abap/ but it has its shortcoming:

    1. Write your own code - The code example is not sufficient
    2. Read the file using KCD_CSV_FILE_TO_INTERN_CONVERT - solution specific (not available everywhere) and will dump on fields that are big enough
    3. Use RTTI and dynamic programming along with FM RSDS_CONVERT_CSV - we do not know the structure in advance
    4. Use class CL_RSDA_CSV_CONVERTER - we do not know the structure in advance

    I also checked the first available solution on github - https://github.com/thedoginthewok/ZwdCSV . Unfortunately, it has macros in the code (absolutely unacceptable) and also requires you to know the structure in advance.

    I also tried to use the regex to do the job, but on big files this is too slow.

    Even though I am extremely annoyed by this fact, I had to create a solution myself (I cannot believe that I actually did it - it should be in the standard...) My first solution was a direct copy paste of Java code into ABAP (https://mkyong.com/java/how-to-read-and-parse-csv-file-in-java/). Unfortunately, as my other question How to iterate over string characters in ABAP in performant way? shown, it is not that easy to iterate over string in abap as it is in Java. I then tried a split/count approach and so far it has the best performance. Does anyone knows the better way achieve this?

    REPORT z_csv_test.
    
    
    CLASS lcl_csv_parser DEFINITION CREATE PRIVATE.
    
      PUBLIC SECTION.
        TYPES:
                 tt_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
        CLASS-METHODS:
          create
            IMPORTING
              !iv_delimiter      TYPE string DEFAULT  '"'
              !iv_separator      TYPE string DEFAULT  ','
              !iv_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf
            RETURNING
              VALUE(r_result)    TYPE REF TO lcl_csv_parser.
        METHODS:
          parse
            IMPORTING
              iv_string       TYPE string
            RETURNING
              VALUE(r_result) TYPE tt_string_matrix,
          constructor
            IMPORTING
              !iv_delimiter      TYPE string
              !iv_separator      TYPE string
              !iv_line_separator TYPE string.
      PROTECTED SECTION.
      PRIVATE SECTION.
        DATA:
          gv_delimiter         TYPE string,
          gv_separator         TYPE string,
          gv_line_separator    TYPE string,
          gv_escaped_delimiter TYPE string.
        METHODS parse_line_to_string_table
          IMPORTING
            iv_line         TYPE string
          RETURNING
            VALUE(r_result) TYPE string_table.
    ENDCLASS.
    
    CLASS lcl_csv_parser IMPLEMENTATION.
    
      METHOD create.
        r_result = NEW #(
          iv_delimiter      = iv_delimiter
          iv_line_separator = CONV #( iv_line_separator )
          iv_separator      = iv_separator  ).
      ENDMETHOD.
    
      METHOD constructor.
        me->gv_delimiter = iv_delimiter.
        me->gv_separator = iv_separator.
        me->gv_line_separator = iv_line_separator.
        me->gv_escaped_delimiter = |{ iv_delimiter }{ iv_delimiter }|.
      ENDMETHOD.
    
      METHOD parse.
        "get the lines
        SPLIT iv_string AT me->gv_line_separator INTO TABLE DATA(lt_lines).
        DATA lx_open_line TYPE abap_bool VALUE abap_false.
        DATA lv_current_line TYPE string.
    
        LOOP AT lt_lines ASSIGNING FIELD-SYMBOL(<ls_line>).
    
          FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_line> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
          IF ( lv_count MOD 2 )  = 1.
            IF lx_open_line = abap_true.
              lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
              lx_open_line = abap_false.
              APPEND parse_line_to_string_table( lv_current_line ) TO r_result.
            ELSE.
              lv_current_line = <ls_line>.
              lx_open_line = abap_true.
            ENDIF.
          ELSE.
            IF lx_open_line = abap_true.
              lv_current_line = |{ lv_current_line }{ me->gv_line_separator }{ <ls_line> }|.
            ELSE.
              APPEND parse_line_to_string_table( <ls_line> ) TO r_result.
            ENDIF.
    
          ENDIF.
        ENDLOOP.
    
      ENDMETHOD.
    
    
      METHOD parse_line_to_string_table.
        SPLIT iv_line AT me->gv_separator INTO TABLE DATA(lt_line).
        DATA lx_open_field TYPE abap_bool VALUE abap_false.
        DATA lv_current_field TYPE string.
        LOOP AT lt_line ASSIGNING FIELD-SYMBOL(<ls_field>).
          FIND ALL OCCURRENCES OF me->gv_delimiter IN <ls_field> IN CHARACTER MODE MATCH COUNT DATA(lv_count).
          IF ( lv_count MOD 2 ) = 1.
            IF lx_open_field = abap_true.
              lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
              lx_open_field = abap_false.
              APPEND lv_current_field TO r_result.
            ELSE.
              lv_current_field = <ls_field>.
              lx_open_field = abap_true.
            ENDIF.
          ELSE.
            IF lx_open_field = abap_true.
              lv_current_field = |{ lv_current_field }{ me->gv_separator }{ <ls_field> }|.
            ELSE.
              APPEND <ls_field> TO r_result.
            ENDIF.
          ENDIF.
    
        ENDLOOP.
    
        REPLACE ALL OCCURRENCES OF me->gv_escaped_delimiter IN TABLE r_result WITH me->gv_delimiter.
    
      ENDMETHOD.
    
    ENDCLASS.
    CLASS lcl_test_csv_parser DEFINITION
      FINAL
      CREATE PUBLIC .
    
      PUBLIC SECTION.
        CLASS-METHODS run.
        CLASS-METHODS get_file
          RETURNING VALUE(r_result) TYPE string.
    
    
    
      PROTECTED SECTION.
      PRIVATE SECTION.
    
    ENDCLASS.
    
    
    
    CLASS lcl_test_csv_parser IMPLEMENTATION.
    
      METHOD get_file.
        DATA lv_file_line TYPE string.
        DO 10 TIMES.
          lv_file_line = |"1234,{ cl_abap_char_utilities=>cr_lf }567890",{ lv_file_line }|.
        ENDDO.
        lv_file_line = lv_file_line && cl_abap_char_utilities=>cr_lf.
    
        DATA(lt_file_as_table) = VALUE string_table(
            FOR i = 1 THEN  i + 1 UNTIL  i = 1000000
                ( lv_file_line ) ).
    
        CONCATENATE LINES OF lt_file_as_table INTO r_result.
      ENDMETHOD.
    
    
    
    
    
      METHOD run.
        DATA lv_prepare_start TYPE timestampl.
        GET TIME STAMP FIELD lv_prepare_start.
    
        DATA(lv_file) = get_file(  ).
    
        DATA lv_prepare_end TYPE timestampl.
        GET TIME STAMP FIELD lv_prepare_end.
    
        WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = lv_prepare_end tstmp2 = lv_prepare_start ) }|.
    
        DATA lv_parse_start TYPE timestampl.
        GET TIME STAMP FIELD lv_parse_start.
        DATA(lo_parser) = lcl_csv_parser=>create(  ).
        DATA(lt_file) = lo_parser->parse( lv_file  ).
        DATA lv_parse_end TYPE timestampl.
        GET TIME STAMP FIELD lv_parse_end.
    
        WRITE |Parse took { cl_abap_tstmp=>subtract( tstmp1 = lv_parse_end tstmp2 = lv_parse_start ) }|.
    
    
      ENDMETHOD.
    
    
    
    ENDCLASS.
    
    START-OF-SELECTION.
      lcl_test_csv_parser=>run( ).
    

    解决方案

    I'd like to present a different approach using find heavily, compared to your line based approach this seems to have equivalent performance for unquoted fields but performs slightly better if quoted fields are present:

    In general, this uses the pattern position = find( off = position + 1 ) to iterate over the string in chunks, and then uses substring to copy ranges into strings. What can be observed here is that in a loop that iterates a million times, every nanosecond saved has an impact on the performance, and by moving as much of it out of the inner loop one can increase performance significantly. For the "simple" case of 10 digit fields one can see that both algorithms perform equally well, however for "longer" 30 digit fields your algorithm is getting faster in comparison. For fields with quotes the scan & concat approach I've used seems to be faster than the "reconstruct" approach. I guess although one can achieve small gains through more clever ABAP, further significant optimizations are only possible by utilizing the engine even more.

    Anyways, Here's the algorithm:

    CLASS lcl_csv_parser_find IMPLEMENTATION.
      METHOD parse.
        DATA line TYPE string_table.
        DATA position TYPE i.
        DATA(string_length) = strlen( i_string ).
    
        " Dereferencing member fields is slightly slower than variable access, in a close loop this matters
        DATA(separators) = me->separators.
        DATA(delimiter)  = me->delimiter.
    
        CHECK string_length <> 0.
    
        " Checking for delimiters in the DO loop is quite slow. 
        " By scanning the whole file once and skipping that check if no delimiter is present
        " This lead to a slight performance increase of 1s for 1 million rows
        DATA(next_delimiter) = find( val = i_string sub = delimiter ).
    
        DO.
          DATA(start_position) = position.
          DATA(field) = ``.
          " Check if field is enclosed in double quotes, as we need to unescape then
          IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
             start_position = start_position + 1. " literal starts after opening quote
    
             DO.
                position = find( val = i_string off = position + 1 sub = delimiter ).
                " literal must be closed
                " ASSERT position <> -1.
    
                DATA(subliteral_length) = position - start_position.
                field = field && substring( val = i_string off = start_position len = subliteral_length ).
    
                DATA(following_position) = position + 1.
                IF position = string_length OR i_string+following_position(1) <> delimiter.
                  " End of literal is reached
                  position = position + 1. " skip closing quote
                  EXIT. " DO
                ELSE.
                  " Found escape quote instead
                  position = following_position + 1.
                  field = field && me->delimiter.
                  " continue searching
                ENDIF.
    
                " ASSERT sy-index < 1000.
             ENDDO.
          ELSE.
            " Unescaped field, simply find the ending comma or newline
            position = find_any_of( val = i_string off = position + 1 sub = separators ).
    
            IF position = -1.
              position = string_length.
            ENDIF.
    
            field = substring( val = i_string off = start_position len = position - start_position ).
          ENDIF.
    
          APPEND field TO line.
    
    
          " Check if line ended and new line is started
          DATA(current) = substring( val = i_string off = position len = 2 ).
          IF current = me->line_separator.
           APPEND line TO r_result.
           CLEAR line.
           position = position + 2. " skip newline
          ELSE.
            " ASSERT i_string+position(1) = me->separator.
            position = position + 1.
          ENDIF.
    
    
          " Check if file ended
          IF position >= string_length.
            RETURN.
          ENDIF.
    
          " ASSERT sy-index < 100000001.
        ENDDO.
    
      ENDMETHOD.
    ENDCLASS.
    


    As a sidenote, instead of creating a huge table of string fields as stated in #1, I would experiment with some kind of "visitor pattern", e.g. pass an instance of such an interface to the parser:

    INTERFACE if_csv_visitor.
      METHODS begin_line.
      METHODS end_line.
      METHODS visit_field
        IMPORTING
          i_field TYPE string.
    ENDINTERFACE.
    

    As in a lot of cases you'll write the CSV fields into a structure anyways, and thus one can save allocating this quite large table.


    And for further reference, here's the whole report:

    *&---------------------------------------------------------------------*
    *& Report Z_CSV
    *&---------------------------------------------------------------------*
    *&
    *&---------------------------------------------------------------------*
    REPORT Z_CSV.
    
    * --------------------- Generic CSV Parser ----------------------------*
    
    CLASS lcl_csv_parser DEFINITION ABSTRACT.
    
      PUBLIC SECTION.
        TYPES:
          t_string_matrix TYPE STANDARD TABLE OF string_table WITH EMPTY KEY.
    
        METHODS:
          parse ABSTRACT
            IMPORTING
              i_string       TYPE string
            RETURNING
              VALUE(r_result) TYPE t_string_matrix,
          constructor
            IMPORTING
              i_delimiter      TYPE string DEFAULT  '"'
              i_separator      TYPE string DEFAULT  ','
              i_line_separator TYPE abap_cr_lf DEFAULT cl_abap_char_utilities=>cr_lf.
    
      PROTECTED SECTION.
        DATA:
          delimiter         TYPE string,
          separator         TYPE string,
          line_separator    TYPE string,
          escaped_delimiter TYPE string,
          separators        TYPE string.
    
    ENDCLASS.
    
    CLASS lcl_csv_parser IMPLEMENTATION.
      METHOD constructor.
        me->delimiter = i_delimiter.
        me->separator = i_separator.
        me->line_separator = i_line_separator.
        me->escaped_delimiter = |{ i_delimiter }{ i_delimiter }|.
        me->separators = i_separator && i_line_separator.
      ENDMETHOD.
    ENDCLASS.
    
    
    * --------------------------- Line based CSV Parser ------------------------ *
    
    CLASS lcl_csv_parser_line DEFINITION INHERITING FROM lcl_csv_parser.
      PUBLIC SECTION.
        METHODS parse REDEFINITION.
    
      PRIVATE SECTION.
        METHODS parse_line_to_string_table
          IMPORTING
            i_line         TYPE string
          RETURNING
            VALUE(r_result) TYPE string_table.
    ENDCLASS.
    
    
    CLASS lcl_csv_parser_line IMPLEMENTATION.
      METHOD parse.
        "get the lines
        SPLIT i_string AT me->line_separator INTO TABLE DATA(lines).
        DATA open_line TYPE abap_bool VALUE abap_false.
        DATA current_line TYPE string.
    
        LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
    
          FIND ALL OCCURRENCES OF me->delimiter IN <line> IN CHARACTER MODE MATCH COUNT DATA(count).
          IF ( count MOD 2 )  = 1.
            IF open_line = abap_true.
              current_line = |{ current_line }{ me->line_separator }{ <line> }|.
              open_line = abap_false.
              APPEND parse_line_to_string_table( current_line ) TO r_result.
            ELSE.
              current_line = <line>.
              open_line = abap_true.
            ENDIF.
          ELSE.
            IF open_line = abap_true.
              current_line = |{ current_line }{ me->line_separator }{ <line> }|.
            ELSE.
              APPEND parse_line_to_string_table( <line> ) TO r_result.
            ENDIF.
    
          ENDIF.
        ENDLOOP.
    
      ENDMETHOD.
    
    
      METHOD parse_line_to_string_table.
        SPLIT i_line AT me->separator INTO TABLE DATA(fields).
    
        DATA open_field TYPE abap_bool VALUE abap_false.
        DATA current_field TYPE string.
    
        LOOP AT fields ASSIGNING FIELD-SYMBOL(<field>).
          FIND ALL OCCURRENCES OF me->delimiter IN <field> IN CHARACTER MODE MATCH COUNT DATA(count).
          IF ( count MOD 2 ) = 1.
            IF open_field = abap_true.
              current_field = |{ current_field }{ me->separator }{ <field> }|.
              open_field = abap_false.
              APPEND current_field TO r_result.
            ELSE.
              current_field = <field>.
              open_field = abap_true.
            ENDIF.
          ELSE.
            IF open_field = abap_true.
              current_field = |{ current_field }{ me->separator }{ <field> }|.
            ELSE.
              APPEND <field> TO r_result.
            ENDIF.
          ENDIF.
    
        ENDLOOP.
    
        REPLACE ALL OCCURRENCES OF me->escaped_delimiter IN TABLE r_result WITH me->delimiter.
    
      ENDMETHOD.
    
    ENDCLASS.
    
    *--------------- Find based CSV Parser ------------------------------------*
    
    CLASS lcl_csv_parser_find DEFINITION INHERITING FROM lcl_csv_parser.
      PUBLIC SECTION.
        METHODS parse REDEFINITION.
    
    ENDCLASS.
    
    CLASS lcl_csv_parser_find IMPLEMENTATION.
      METHOD parse.
        DATA line TYPE string_table.
        DATA position TYPE i.
        DATA(string_length) = strlen( i_string ).
    
        " Dereferencing member fields is slightly slower than variable access, in a close loop this matters
        DATA(separators) = me->separators.
        DATA(delimiter)  = me->delimiter.
    
        CHECK string_length <> 0.
    
        " Checking for delimiters in the DO loop is quite slow.
        " By scanning the whole file once and skipping that check if no delimiter is present
        " This lead to a slight performance increase of 1s for 1 million rows
        DATA(next_delimiter) = find( val = i_string sub = delimiter ).
    
        DO.
          DATA(start_position) = position.
          DATA(field) = ``.
          " Check if field is enclosed in double quotes, as we need to unescape then
          IF next_delimiter <> -1 AND i_string+position(1) = delimiter.
             start_position = start_position + 1. " literal starts after opening quote
    
             DO.
                position = find( val = i_string off = position + 1 sub = delimiter ).
                " literal must be closed
                " ASSERT position <> -1.
    
                DATA(subliteral_length) = position - start_position.
                field = field && substring( val = i_string off = start_position len = subliteral_length ).
    
                DATA(following_position) = position + 1.
                IF position = string_length OR i_string+following_position(1) <> delimiter.
                  " End of literal is reached
                  position = position + 1. " skip closing quote
                  EXIT. " DO
                ELSE.
                  " Found escape quote instead
                  position = following_position + 1.
                  field = field && me->delimiter.
                  " continue searching
                ENDIF.
    
                " ASSERT sy-index < 1000.
             ENDDO.
          ELSE.
            " Unescaped field, simply find the ending comma or newline
            position = find_any_of( val = i_string off = position + 1 sub = separators ).
    
            IF position = -1.
              position = string_length.
            ENDIF.
    
            field = substring( val = i_string off = start_position len = position - start_position ).
          ENDIF.
    
          APPEND field TO line.
    
    
          " Check if line ended and new line is started
          DATA(current) = substring( val = i_string off = position len = 2 ).
          IF current = me->line_separator.
           APPEND line TO r_result.
           CLEAR line.
           position = position + 2. " skip newline
          ELSE.
            " ASSERT i_string+position(1) = me->separator.
            position = position + 1.
          ENDIF.
    
    
          " Check if file ended
          IF position >= string_length.
            RETURN.
          ENDIF.
    
          " ASSERT sy-index < 100000001.
        ENDDO.
    
      ENDMETHOD.
    ENDCLASS.
    
    * -------------------- Tests -------------------------------------------------------- *
    
    CLASS lcl_test_csv_parser DEFINITION
      FINAL
      CREATE PUBLIC .
    
      PUBLIC SECTION.
        CLASS-METHODS run.
        CLASS-METHODS get_file_complex
          RETURNING VALUE(r_result) TYPE string.
        CLASS-METHODS get_file_simple
          RETURNING VALUE(r_result) TYPE string.
        CLASS-METHODS get_file_long
          RETURNING VALUE(r_result) TYPE string.
        CLASS-METHODS get_file_longer
          RETURNING VALUE(r_result) TYPE string.
        CLASS-METHODS get_file_mixed
          RETURNING VALUE(r_result) TYPE string.
    
    
    
      PROTECTED SECTION.
      PRIVATE SECTION.
    
    ENDCLASS.
    
    
    
    CLASS lcl_test_csv_parser IMPLEMENTATION.
    
      METHOD get_file_complex.
        DATA(file_line) =
          repeat( val = |"1234,{ cl_abap_char_utilities=>cr_lf }7890",| occ = 9 ) &&
          |"1234,{ cl_abap_char_utilities=>cr_lf }7890"| &&
          cl_abap_char_utilities=>cr_lf.
    
        r_result = repeat( val = file_line occ = 1000000 ).
      ENDMETHOD.
    
      METHOD get_file_simple.
        DATA(file_line) =
          repeat( val = |1234567890,| occ = 9 ) &&
          |1234567890| &&
          cl_abap_char_utilities=>cr_lf.
    
        r_result = repeat( val = file_line occ = 1000000 ).
      ENDMETHOD.
    
      METHOD get_file_long.
        DATA(file_line) =
          repeat( val = |12345678901234567890,| occ = 4 ) &&
          |12345678901234567890| &&
          cl_abap_char_utilities=>cr_lf.
    
        r_result = repeat( val = file_line occ = 1000000 ).
      ENDMETHOD.
    
      METHOD get_file_longer.
        DATA(file_line) =
          repeat( val = |1234567890123456789012345678901234567890,| occ = 2 ) &&
          |1234567890123456789012345678901234567890| &&
          cl_abap_char_utilities=>cr_lf.
    
        r_result = repeat( val = file_line occ = 1000000 ).
      ENDMETHOD.
    
    
      METHOD get_file_mixed.
        DATA(file_line) =
          |1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890,"1234,{ cl_abap_char_utilities=>cr_lf }7890",1234567890,1234567890,1234567890| &&
          cl_abap_char_utilities=>cr_lf.
    
        r_result = repeat( val = file_line occ = 1000000 ).
      ENDMETHOD.
    
    
    
      METHOD run.
        DATA prepare_start TYPE timestampl.
        GET TIME STAMP FIELD prepare_start.
    
        TYPES:
          BEGIN OF t_file,
            name    TYPE string,
            content TYPE string,
          END OF t_file,
          t_files TYPE STANDARD TABLE OF t_file WITH EMPTY KEY.
        DATA(files) = VALUE t_files(
         ( name = `simple`  content = get_file_simple( )  )
         ( name = `long`    content = get_file_long( )    )
         ( name = `longer`  content = get_file_long( )    )
         ( name = `complex` content = get_file_complex( ) )
         ( name = `mixed`   content = get_file_mixed( )   )
        ).
    
        DATA prepare_end TYPE timestampl.
        GET TIME STAMP FIELD prepare_end.
        WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|. SKIP 2.
    
        WRITE: 'File', 15 'Line Parse', 30 'Find Parse', 45 'Match'. NEW-LINE.
        ULINE.
    
        LOOP AT files INTO DATA(file).
    
          WRITE file-name UNDER 'File'.
          DATA line_start TYPE timestampl.
          GET TIME STAMP FIELD line_start.
    
          DATA(line_parser) = NEW lcl_csv_parser_line(  ).
          DATA(line_result) = line_parser->parse( file-content ).
    
          DATA line_end TYPE timestampl.
          GET TIME STAMP FIELD line_end.
          WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }s| UNDER 'Line Parse'.
    
    
          DATA find_start TYPE timestampl.
          GET TIME STAMP FIELD find_start.
    
          DATA(find_parser) = NEW lcl_csv_parser_find(  ).
          DATA(find_result) = find_parser->parse( file-content ).
    
          DATA find_end TYPE timestampl.
          GET TIME STAMP FIELD find_end.
          WRITE |{ cl_abap_tstmp=>subtract( tstmp1 = find_end tstmp2 = find_start ) }s| UNDER 'Find Parse'.
    
          " WRITE COND #( WHEN line_result = find_result THEN 'yes' ELSE 'no') UNDER 'Match'.
          NEW-LINE.
        ENDLOOP.
      ENDMETHOD.
    
    
    
    ENDCLASS.
    
    START-OF-SELECTION.
      lcl_test_csv_parser=>run( ).
    

    这篇关于如何以最高效的方式解析 CSV 文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆