如何在SAP HANA中建立只读一次实施? [英] How to establish read-only-once implement within SAP HANA?

查看:119
本文介绍了如何在SAP HANA中建立只读一次实施?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上下文:我是MSSQL的长期开发人员...我想知道如何从SAP HANA实现一次只读选择.

Context: I am a long-time MSSQL developer... What I would like to know is how to implement a read-only-once select from SAP HANA.

高级伪代码:

  1. 通过db proc(查询)收集请求
  2. 通过请求调用API
  3. 存储请求(响应)的结果

我有一张表(A),它是流程输入的来源.流程完成后,它将结果写入另一个表(B).

I have a table (A) that is the source of inputs to a process. Once a process has completed it will write results to another table (B).

如果我只是在表A中添加一列,以避免并发处理器从A中选择相同的记录,也许这一切都解决了?

Perhaps this is all solved if I just add a column to table A to avoid concurrent processors from selecting the same records from A?

我想知道如何在不将列添加到源表A的情况下执行此操作.

I am wondering how to do this without adding the column to source table A.

我尝试过的是在表A和表B之间进行左外部联接,以从表A中获得行(但在表B中没有对应行).这行不通,或者我没有实现对行进行处理任何处理器只能处理1次.

What I have tried is a left outer join between tables A and B to get rows from A that have no corresponding rows (yet) in B. This doesn't work, or I haven't implemented such that rows are processed only 1 time by any of the processors.

我有一个存储的proc来处理批量选择:

I have a stored proc to handle batch selection:

/*
 *      getBatch.sql
 *
 *      SYNOPSIS:  Retrieve the next set of criteria to be used in a search
 *                 request.  Use left outer join between input source table
 *                 and results table to determine the next set of inputs, and
 *                 provide support so that concurrent processes may call this
 *                 proc and get their inputs exclusively.
 */
alter procedure "ACOX"."getBatch" (
     in in_limit int
    ,in in_run_group_id varchar(36)
    ,out ot_result table (
         id bigint
        ,runGroupId varchar(36)
        ,sourceTableRefId integer
        ,name nvarchar(22)
        ,location nvarchar(13)
        ,regionCode nvarchar(3)
        ,countryCode nvarchar(3)
    )
) language sqlscript sql security definer as
begin       

    -- insert new records:
    insert into "ACOX"."search_result_v4" (
         "RUN_GROUP_ID"
        ,"BEGIN_DATE_TS"
        ,"SOURCE_TABLE"
        ,"SOURCE_TABLE_REFID"   
    )
    select
         in_run_group_id as "RUN_GROUP_ID"
        ,CURRENT_TIMESTAMP as "BEGIN_DATE_TS"
        ,'acox.searchCriteria' as "SOURCE_TABLE"
        ,fp.descriptor_id as "SOURCE_TABLE_REFID"
    from 
        acox.searchCriteria fp
    left join "ACOX"."us_state_codes" st
        on trim(fp.region) = trim(st.usps)
    left outer join "ACOX"."search_result_v4" r
        on fp.descriptor_id = r.source_table_refid
    where
        st.usps is not null
        and r.BEGIN_DATE_TS is null
    limit :in_limit;
    
    -- select records inserted for return:
    ot_result =
    select
         r.ID id
        ,r.RUN_GROUP_ID runGroupId
        ,fp.descriptor_id sourceTableRefId
        ,fp.merch_name name
        ,fp.Location location
        ,st.usps regionCode
        ,'USA' countryCode
    from 
        acox.searchCriteria fp
    left join "ACOX"."us_state_codes" st
        on trim(fp.region) = trim(st.usps)
    inner join "ACOX"."search_result_v4" r
        on fp.descriptor_id = r.source_table_refid
        and r.COMPLETE_DATE_TS is null
        and r.RUN_GROUP_ID = in_run_group_id
    where
        st.usps is not null
    limit :in_limit;

end;

当运行7个并发处理器时,出现35%的重叠.也就是说,在5,000个输入行中,结果行计数为6,755.运行时间约为7分钟.

When running 7 concurrent processors, I get a 35% overlap. That is to say that out of 5,000 input rows, the resulting row count is 6,755. Running time is about 7 mins.

当前,我的解决方案包括在源表中添加一列.我想避免这种情况,但它似乎使实现起来更简单.我将在短期内更新代码,但在插入之前包含一条更新语句.

Currently my solution includes adding a column to the source table. I wanted to avoid that but it seems to make a simpler implement. I will update the code shortly, but it includes an update statement prior to the insert.

有用的参考文献:

  • SAP HANA Concurrency Control
  • Exactly-Once Semantics Are Possible: Here’s How Kafka Does It

推荐答案

首先,没有"ready-only- 一次";在任何RDBMS中,包括MS SQL. 从字面上看,这意味着给定的记录只能被读取一次,然后将消失".用于所有后续读取. (实际上就是队列的工作,或者队列的众所周知的特殊情况:管道)

First off: there is no "read-only-once" in any RDBMS, including MS SQL. Literally, this would mean that a given record can only be read once and would then "disappear" for all subsequent reads. (that's effectively what a queue does, or the well-known special-case of a queue: the pipe)

我认为那不是您想要的.

I assume that that is not what you are looking for.

相反,我相信您希望实现类似于仅一次且仅一次"的处理语义.又名恰好一次"消息传递.虽然不可能可以在潜在的分区网络中实现,在数据库的事务上下文中.

Instead, I believe you want to implement a processing-semantic analogous to "once-and-only-once" aka "exactly-once" message delivery. While this is impossible to achieve in potentially partitioned networks it is possible within the transaction context of databases.

这是常见的要求,例如使用批处理数据加载作业,这些作业仅应加载到目前为止尚未加载的数据(即在上一个批量加载作业开始后创建的 new 数据).

This is a common requirement, e.g. with batch data loading jobs that should only load data that has not been loaded so far (i.e. the new data that was created after the last batch load job began).

很抱歉,很长的篇幅,但是对此的任何解决方案都取决于明确我们要实际实现的目标.我现在将解决这个问题.

Sorry for the long pre-text, but any solution for this will depend on being clear on what we want to actually achieve. I will get to an approach for that now.

主要的RDBMS很早就发现,如果目标是实现高事务吞吐量,那么阻塞读取器通常是一个糟糕的主意.因此,HANA不会阻止读取器-永远不会(好的,不是永远,但在正常操作设置中). 主要问题是一次".处理要求实际上不是读取记录,而是处理一次以上或完全不进行的可能性.

The major RDBMS have long figured out that blocking readers is generally a terrible idea if the goal is to enable high transaction throughput. Consequently, HANA does not block readers - ever (ok, not ever-ever, but in the normal operation setup). The main issue with the "exactly-once" processing requirement really is not the reading of the records, but the possibility of processing more than once or not at all.

可以通过以下方法解决这两个潜在问题:

Both of these potential issues can be addressed with the following approach:

  1. SELECT ... FOR UPDATE ...应处理的记录(例如,基于未处理的记录,最多N条记录,奇偶ID,邮政编码等).这样,当前会话将具有UPDATE TRANSACTION上下文,并且在所选记录上具有排他锁.其他事务仍然可以读取这些记录,但是没有其他事务可以锁定这些记录-既不用于UPDATEDELETE,也不用于SELECT ... FOR UPDATE ....

  1. SELECT ... FOR UPDATE ... the records that should be processed (based on e.g. unprocessed records, up to N records, even-odd-IDs, zip-code, ...). With this, the current session has an UPDATE TRANSACTION context and exclusive locks on the selected records. Other transactions can still read those records, but no other transaction can lock those records - neither for UPDATE, DELETE, nor for SELECT ... FOR UPDATE ... .

现在您要进行处理-无论涉及什么:合并,插入,更新其他表,编写日志条目...

Now you do your processing - whatever this involves: merging, inserting, updating other tables, writing log-entries...

作为处理的最后一步,您想在计算机上标记".记录为已处理.究竟如何实现,这并不重要. 一个可以在表中创建一个 processed 列,并在处理记录后将其设置为TRUE.或者可以有一个单独的表,其中包含已处理记录的主键(也许还有一个 load-job-id 来跟踪多个加载作业). 无论采用哪种实施方式,都是需要捕获此processed状态的时间点.

As the final step of the processing, you want to "mark" the records as processed. How exactly this is implemented, does not really matter. One could create a processed-column in the table and set it to TRUE when records have been processed. Or one could have a separate table that contains the primary keys of the processed records (and maybe a load-job-id to keep track of multiple load jobs). In whatever way this is implemented, this is the point in time, where this processed status needs to be captured.

COMMITROLLBACK(以防出现问题).这将COMMIT写入目标表的记录,已处理状态信息,,并将释放源表中的排他锁.

COMMIT or ROLLBACK (in case something went wrong). This will COMMIT the records written to the target table, the processed-status information, and it will release the exclusive locks from the source table.

如您所见,第1步解决了可能会丢失的记录的问题,方法是选择所有可以处理的通缉记录(即它们并没有被其他任何进程独占锁定). 步骤3 通过跟踪已处理的记录来解决可能多次处理的记录问题.显然,必须在步骤1 中检查此跟踪-这两个步骤是相互关联的,这就是为什么我明确指出它们的原因.最后,所有处理都在同一个DB事务上下文中进行,从而保证了整个事务中的COMMITROLLBACK.就是说,没有记录标记"被删除.提交记录处理后,这些信息将永远丢失.

As you see, Step 1 takes care of the issue that records may be missed by selecting all wanted records that can be processed (i.e. they are not exclusively locked by any other process). Step 3 takes care of the issue of records potentially be processed more than once by keeping track of the processed records. Obviously, this tracking has to be checked in Step 1 - both steps are interconnected, which is why I point them out explicitly. Finally, all the processing occurs within the same DB-transaction context, allowing for guaranteed COMMIT or ROLLBACK across the whole transaction. That means, that no "record marker" will ever be lost when the processing of the records was committed.

现在,为什么这种方法比使记录不可读"更可取? 因为系统中还有其他进程.

Now, why is this approach preferable to making records "un-readable"? Because of the other processes in the system.

也许交易记录系统仍在读取源记录,但从未更新过.该交易系统不必等待数据加载完成.

Maybe the source records are still read by the transaction system but never updated. This transaction system should not have to wait for the data load to finish.

或者,也许有人想对源数据进行一些分析,并且还需要读取这些记录.

Or maybe, somebody wants to do some analytics on the source data and also needs to read those records.

或者您可能想并行化数据加载:很容易跳过锁定的记录,而仅对可用于更新"的记录进行操作.现在.参见例如在批量处理时对SQL读取进行负载平衡吗?那个.

Or maybe you want to parallelise the data loading: it's easily possible to skip locked records and only work on the ones that are "available for update" right now. See e.g. Load balancing SQL reads while batch-processing? for that.

好吧,我想你是在希望更容易消费的东西. las,据我所知,这就是我对这种要求的处理方式.

Ok, I guess you were hoping for something easier to consume; alas, that's my approach to this sort of requirement as I understood it.

这篇关于如何在SAP HANA中建立只读一次实施?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆