如何以高性能方式迭代 ABAP 中的字符串字符? [英] How to iterate over string characters in ABAP in performant way?

查看:23
本文介绍了如何以高性能方式迭代 ABAP 中的字符串字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有其他(更快的)方法可以在 ABAP 中迭代字符串.

I would like to know if there are other(faster) ways to iterate over string in ABAP.

天真的方法 - 使用子字符串访问进行交互 - 在 100mb 的文件上太慢

The naive approach - interating using substring access - is too slow on files of 100mb

DATA(lv_string) = |1234567890|.
DATA(lv_strlen) = strlen( lv_string ).
DO ( lv_strlen - 1 ) TIMES.
  DATA(lv_current_symbol) = lv_string+sy-index(1).
ENDDO.

通过将字符串块分配给具有最大长度的类型 c 的字段,然后为其分配字段符号,我实现了 50% 的性能提升 - 但它仅提高了 50% 并且代码看起来很丑

I achieved 50% performance increase by assigning chunks of string to the field of type c with maximum length and then assigning field symbol to it - but it is only 50% increase and the code looks ugly

CLASS lcl_ IMPLEMENTATION.

  METHOD main.

    "prepare the file to parsed
    DATA(lv_file) = me->get_file(  ).

    DATA lv_chunk TYPE c LENGTH 262143.
    CONSTANTS lc_chunk_size TYPE int4 VALUE 262143.
    DATA(lv_strlen) = strlen( lv_file ).

    GET TIME STAMP FIELD DATA(lv_time_stamp).
    WRITE / lv_time_stamp.

    DATA(lv_times) = lv_strlen DIV lc_chunk_size.
    IF ( lv_strlen  MOD lc_chunk_size > 0 ).
      lv_times = lv_times + 1.
    ENDIF.


    DO lv_times TIMES.
      DATA(lv_offset) = lc_chunk_size * ( sy-index - 1 ).
      IF  sy-index   = lv_times.
        DATA(lv_length) = lv_strlen MOD lc_chunk_size.
      ELSE.
        lv_length = lc_chunk_size.
      ENDIF.
      lv_chunk = lv_file+lv_offset(lv_length).
      FIELD-SYMBOLS <char1> TYPE c.

      ASSIGN lv_chunk+0(1) TO <char1>.
      DATA(lv_actual_length) = lv_length - 1.
      DO lv_actual_length TIMES.
        ASSIGN lv_chunk+sy-index(1) TO <char1>.
      ENDDO.

    ENDDO.

    GET TIME STAMP FIELD lv_time_stamp.
    WRITE / lv_time_stamp.
    DATA(lv_naive_strlen) = ( lv_strlen - 1 ).
    DO lv_naive_strlen TIMES.
      DATA(lv_current_symbol) = lv_file+sy-index(1).
    ENDDO.
    GET TIME STAMP FIELD lv_time_stamp.
    WRITE / lv_time_stamp.
  ENDMETHOD.


  METHOD get_file.
    DATA lv_file_line TYPE string.
    DO 10 TIMES.
      lv_file_line = |1234567890,{ lv_file_line }|.
    ENDDO.
    lv_file_line = lv_file_line && |;|.

    DATA(lt_file_as_table) = VALUE string_table(
        FOR i = 1 THEN  i + 1 UNTIL  i = 1000000
            ( lv_file_line ) ).

    CONCATENATE LINES OF lt_file_as_table INTO r_result.


  ENDMETHOD.

ENDCLASS.

有人有更好的方法吗?

更新 - 有一个问题为什么我需要这个 - 基本上我需要根据 RFC 解析一个 CSV 文件 https://datatracker.ietf.org/doc/html/rfc4180

Update - there was a question why I need this - basically I need to parse a CSV file according to the RFC https://datatracker.ietf.org/doc/html/rfc4180

更新 - 我已经更新了代码并使用我的 S4H 开发人员版本进行了检查.分块方法对我来说需要 39 秒,而天真的方法需要 70 秒.我猜这不到 50%(但即使有 50% 的改进,代码也很丑陋)

Update - I have updated the code and checked using my S4H developer edition. The chunked approach is taking 39 seconds for me and the naïve approach is taking 70 seconds. I guess this is less than 50% (but even with 50% improvement the code is quite ugly)

更新 - 只是为了显示您可以运行以下 Java 类的速度有多快.在我的机器上,性能差异是......惊人

Update - just to show how much faster it could be you can run the following Java class. On my machine the performance difference is... staggering

public class Main {

    public static void main(String[] args) {
        String lv_file = get_file();

        var lt_char_array = lv_file.toCharArray();
        var lv_char_array_length = lt_char_array.length;
        var lv_counter = 0;
        for (char lv_char : lt_char_array) {
            if (lv_char == ';'){
                lv_counter++;
            }
        }
        System.out.println(lv_counter);
    }

    private static String get_file() {
        StringBuilder lv_file_line = new StringBuilder("1234567890,");
        lv_file_line.append(String.valueOf(lv_file_line).repeat(10));
        lv_file_line.append(";");
        return String.valueOf(lv_file_line).repeat(1000000);
    }
}

推荐答案

如何以高性能的方式在 ABAP 中迭代字符串字符?

How to iterate over string characters in ABAP in performant way?

你没有.在更高级别的编程语言中逐个字符地迭代字符串总是比引擎在较低级别可以执行的操作慢.因此,通过利用引擎(ABAP 内核)的功能,例如 XML 和 JSON 的内置解析器、正则表达式引擎(尤其是 JIT 编译),如果通过使用内置的字符串方法(SPLITsubstringfind 等),您确实拥有那些无法涵盖的内容.

You don't. Iterating over strings character by character in higher level programming languages will always be slower than what the engine can do at a lower level. Thus you can always achieve better performance by utilizing capabilities of the engine (the ABAP kernel) such as the inbuilt parsers for XML and JSON, the regular expression engine (especially with JIT compiling), and if you really have something that cannot be covered by those, by using the inbuilt string methods (SPLIT, substring, find and the alike).

对于解析 CSV 可以做这样的事情(你可能会找到类似的方法来处理你想要做的任何字符串):

For parsing CSV one could do something like this (and you'll probably find similar approaches for whatever string processing you're trying to do):

  METHOD parse_chunk_line.
    SPLIT i_file AT cl_abap_char_utilities=>newline INTO TABLE DATA(lines).

    LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
      SPLIT <line> AT ',' INTO TABLE DATA(values).

      LOOP AT values ASSIGNING FIELD-SYMBOL(<value>).
        DATA(value_len) = strlen( <value> ) - 1.

        IF <value>+0(1) = '"'.
          " Value opened
        ENDIF.

        IF <value>+value_len(1) = '"'.
          " Value closed
        ENDIF.
      ENDLOOP.
    ENDLOOP.
  ENDMETHOD.

在我的测试环境中,这优于您的块方法";乘以 3(运行 5 秒,我认为处理 100MB 相当不错),虽然我认为这些测试没有给你任何有意义的结果.在真实世界场景中实现所有三种方法,然后在真实"场景中运行它.ABAP 系统,如果你想要一个可以从中得出结论的措施.

In my test environment this outperforms your "chunk approach" by a factor of 3 (running in 5s which is I think quite okay for processing 100MB), though I think these tests are not giving you any meaningful results. Implement all three approaches in a real world scenario, then run it on a "real" ABAP system if you want measures one can draw conclusions from.

你注意到了

但是为了适应 RFC,您首先需要逐个字符地读取,因为您不知道换行符是否包含在转义字符中

but to accommodate to the RFC, you first need to read character by character since you do not know whether the new line is contained within escape characters or not

这是真的,所以我会简单地保留一个布尔值,无论您是否在文字中(找到了开头的但没有找到结尾的"),然后重建";通过读取循环中的换行符和逗号来获得实际值.这样,在值中没有换行符或逗号的情况下(我希望这是一个极端情况),代码采用快速路径"并且您可以简单地将 存储到生成的数据结构中.

This is true, so I would simply keep a boolean whether you're in a literal (an opening " was found but no closing one), then "reconstruct" the actual value by readding the newlines and commas in the loop. That way in the case where there are no newlines or commas in a value (I hope this is a corner case), the code takes the "fast path" and you can simply store <value> into your resulting datastructure.

就其价值而言,这是我的微基准:

For what it's worth, here's my microbenchmark:

CLASS Z_CSV DEFINITION
  PUBLIC
  FINAL
  CREATE PUBLIC .

PUBLIC SECTION.
  CLASS-METHODS run.
  CLASS-METHODS get_file
   RETURNING VALUE(r_result) TYPE string.

  CLASS-METHODS parse_chunk
    IMPORTING i_file TYPE string.
  CLASS-METHODS parse_naive
    IMPORTING i_file TYPE string.
  CLASS-METHODS parse_chunk_line
    IMPORTING i_file TYPE string.

PROTECTED SECTION.
PRIVATE SECTION.
ENDCLASS.



CLASS Z_CSV IMPLEMENTATION.

  METHOD get_file.
    DATA lv_file_line TYPE string.
    DO 10 TIMES.
      lv_file_line = |1234567890,{ lv_file_line }|.
    ENDDO.
    lv_file_line = lv_file_line && |;|.

    DATA(lt_file_as_table) = VALUE string_table(
        FOR i = 1 THEN  i + 1 UNTIL  i = 1000000
            ( lv_file_line ) ).

    CONCATENATE LINES OF lt_file_as_table INTO r_result.
  ENDMETHOD.



  METHOD parse_chunk.
    DATA(value_count) = 0.

    DATA lv_chunk TYPE c LENGTH 262143.
    CONSTANTS lc_chunk_size TYPE int4 VALUE 262143.
    DATA(lv_strlen) = strlen( i_file ).


    DATA(lv_times) = lv_strlen DIV lc_chunk_size.
    IF ( lv_strlen  MOD lc_chunk_size > 0 ).
      lv_times = lv_times + 1.
    ENDIF.


    DO lv_times TIMES.
      DATA(lv_offset) = lc_chunk_size * ( sy-index - 1 ).
      IF  sy-index   = lv_times.
        DATA(lv_length) = lv_strlen MOD lc_chunk_size.
      ELSE.
        lv_length = lc_chunk_size.
      ENDIF.
      lv_chunk = i_file+lv_offset(lv_length).
      FIELD-SYMBOLS <char1> TYPE c.

      ASSIGN lv_chunk+0(1) TO <char1>.
      DATA(lv_actual_length) = lv_length - 1.
      DO lv_actual_length TIMES.
        ASSIGN lv_chunk+sy-index(1) TO <char1>.

        IF <char1> = ','.
          value_count = value_count + 1.
        ENDIF.
      ENDDO.

    ENDDO.


    WRITE |Chunk counted { value_count }|.
  ENDMETHOD.


  METHOD parse_chunk_line.
    DATA(value_count) = 0.

    SPLIT i_file AT cl_abap_char_utilities=>newline INTO TABLE DATA(lines).

    LOOP AT lines ASSIGNING FIELD-SYMBOL(<line>).
      SPLIT <line> AT ',' INTO TABLE DATA(values).

      LOOP AT values ASSIGNING FIELD-SYMBOL(<value>).
        value_count = value_count + 1.

        DATA(value_len) = strlen( <value> ) - 1.

        IF <value>+0(1) = '"'.

        ENDIF.

        IF <value>+value_len(1) = '"'.

        ENDIF.
      ENDLOOP.
    ENDLOOP.

    WRITE |Line chunked counted { value_count }|.
  ENDMETHOD.



  METHOD parse_naive.
    DATA(value_count) = 0.
    DATA(lv_strlen) = strlen( i_file ).
    DATA(lv_naive_strlen) = ( lv_strlen - 1 ).
    DO lv_naive_strlen TIMES.
      DATA(lv_current_symbol) = i_file+sy-index(1).
      IF lv_current_symbol = ','.
        value_count = value_count + 1.
      ENDIF.
    ENDDO.

    WRITE |Naive counted { value_count }|.
  ENDMETHOD.


  METHOD run.
    DATA prepare_start TYPE timestampl.
    GET TIME STAMP FIELD prepare_start.

    DATA(file) = get_file(  ).

    DATA prepare_end TYPE timestampl.
    GET TIME STAMP FIELD prepare_end.

    WRITE |Preparation took { cl_abap_tstmp=>subtract( tstmp1 = prepare_end tstmp2 = prepare_start ) }|.


    DATA naive_start TYPE timestampl.
    GET TIME STAMP FIELD naive_start.
    parse_naive( file ).
    DATA naive_end TYPE timestampl.
    GET TIME STAMP FIELD naive_end.

    WRITE |Naive run took { cl_abap_tstmp=>subtract( tstmp1 = naive_end tstmp2 = naive_start ) }|.


    DATA chunk_start TYPE timestampl.
    GET TIME STAMP FIELD chunk_start.
    parse_chunk( file ).

    DATA chunk_end TYPE timestampl.
    GET TIME STAMP FIELD chunk_end.
    WRITE |Chunk run took { cl_abap_tstmp=>subtract( tstmp1 = chunk_end tstmp2 = chunk_start ) }|.

    DATA line_start TYPE timestampl.
    GET TIME STAMP FIELD line_start.
    parse_chunk_line( file ).

    DATA line_end TYPE timestampl.
    GET TIME STAMP FIELD line_end.
    WRITE |Line Chunk run took { cl_abap_tstmp=>subtract( tstmp1 = line_end tstmp2 = line_start ) }|.

  ENDMETHOD.
ENDCLASS.

这篇关于如何以高性能方式迭代 ABAP 中的字符串字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆