删除 Redshift 中的循环 [英] Remove loop in Redshift

查看:109
本文介绍了删除 Redshift 中的循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含有关我的数据库中不同表的信息.对于其中的每个表,我希望进行定期日志记录(每天).问题是我已经使用 for 循环完成了此操作,并且在 Redshift 中执行需要花费大量时间.我提供了用于创建表、日志记录表和使用 for 循环的程序的查询.请提出一种替代方法来实现这一点,而无需 for 循环.包含不同表信息的表结构:

I have a table that contains information about the different tables in my database. For each table in it, I want regular logging (daily) to be done. The problem is that I have done this using for loop, and it is taking a lot of time to execute in Redshift. I am giving the queries I used to create my table, logging table and the Procedure that uses the for loop. Please suggest an alternate method to implement this without for loop. Structure of table that contains info of different tables:

CREATE TABLE public.info_schema_table
(
    info_schema_name character varying(200) ENCODE lzo,
    info_object_name character varying(200) ENCODE lzo,
    info_object_type character varying(200) ENCODE lzo,
    info_object_full_name character varying(400) ENCODE lzo
)
DISTSTYLE EVEN;

这里,对象名是指表名,object_type 包含表是视图还是表,object_full_name 存储表与模式的连接名称,即模式名是test_schema".那么全名是test_schema.table_name".需要存储日志的表结构:

Here, object name refers to the table name, object_type contains if the table is view or table, and the object_full_name stores the concatenated name of the table with the schema i.e. if schema name is "test_schema" then full name is "test_schema.table_name". Structure of table in which logging needs to be stored:

CREATE TABLE public.redshift_logging_table
(
    log_schema_name character varying(30) ENCODE lzo,
    log_object_name character varying(30) ENCODE lzo,
    log_object_type character varying(30) ENCODE lzo,
    log_refresh_date date ENCODE az64,
    log_refresh_count bigint ENCODE az64,
    log_total_count bigint ENCODE az64
)
DISTSTYLE EVEN;

这里,refresh_date 存储日志记录的日期,refresh_count 存储该特定日期插入表中的记录数,total_count 包含表中到记录日期为止的记录总数.

Here, refresh_date stores the date of logging, refresh_count stores the number of records inserted in the table on that particular day and total_count contains the total number of records in the table till the logging date.

为了澄清起见,这里是 info_schema_table 的一个示例记录:

Just for clarification, here's one sample record of the info_schema_table:

这是我用来填充日志记录表的程序:

Here's my procedure that I am using to fill the logging table:

CREATE OR REPLACE PROCEDURE public.REDSHIFT_LOGGING_PROCEDURE()
AS $$
DECLARE
var_total_count bigint;
var_records_today bigint;
my_row record;
my_cursor CURSOR  
FOR select  info_schema_name, info_object_name, info_object_type , info_object_full_name  from INFO_SCHEMA_TABLE ;
BEGIN
  open my_cursor;
  LOOP
        FETCH my_cursor INTO my_row;
        EXIT WHEN NOT FOUND;
            BEGIN

             EXECUTE ' select count(1)::bigint  from '||my_row.INFO_OBJECT_FULL_NAME INTO var_total_count;
             var_records_today=var_total_count-(select log_total_count  from REDSHIFT_LOGGING_TABLE where REFRESH_DATE=current_date-1 and LOG_OBJECT_NAME=my_row.INFO_OBJECT_NAME);
             
             insert into REDSHIFT_LOGGING_TABLE
              (LOG_SCHEMA_NAME,LOG_OBJECT_NAME,LOG_OBJECT_TYPE,LOG_REFRESH_DATE, LOG_REFRESH_COUNT, LOG_TOTAL_COUNT )
              values 
              (my_row.info_schema_name, my_row.info_object_name, my_row.info_object_type , current_date, var_records_today, var_total_count);
            END;

  END LOOP;
END;
$$ LANGUAGE plpgsql
SECURITY INVOKER;

所有表的记录数只会一天天增加,因此在任何特定日期插入的记录总是>=0.问题是,此过程仅适用于 info_schema_table 中的小记录,如果我运行它大约 1000 条记录,则该过程即使在一小时内也无法在 Redshift 上完成.

All the table only increase in number of records day by day, so records inserted on any particular day is always >=0. Issue is that, this procedure works but only for small records in info_schema_table, if I run it for around 1000 records, the procedure doesn't completes on Redshift even in one hour.

请建议一种替代方法来执行它而不使用 for 循环.谢谢.

PLEASE suggest an alternate method to execute it without using for loop. Thank you.

推荐答案

我不愿意深入探讨这个问题,因为它涉及如此广泛的问题 - 我最终会试图解释知识金字塔,这就是问题所在;您可以在帖子中合理解释的内容就这么多.

I'm loath to go into the problem proper because it touches upon such a large range of issues - I'd end up trying to explain a pyramid of knowledge, and that's problematic here; there's only so much you can reasonably explain in a post.

我会做一些肤浅的观察.

I will make a couple of superficial observations.

  1. LZO 已过时.它编码很快,但解码很慢.它已被 ZSTD 取代,后者的编码速度比 LZO 慢,但解码速度更快——这通常是您想要的——并且压缩得更多.

  1. LZO is obsolete. It encodes quickly, but decodes slowly. It has been superseded by ZSTD, which encodes more slowly than LZO, but decodes more quickly - which is normally what you want - and it compresses significantly more.

模式名和表名的最大长度为 127 char 使用长度 200 太长,30 太短.但是请注意,您do 需要使用 varchar,因为尽管在系统表中名称的类型是 char,但 Redshift 编写多字节 UTF-8 到这些列中,所以如果你说 UNLOAD 它们,然后 COPY 它们到原始表的副本中,COPY 将失败将 UTF-8 加载到 char 的基础上(翻白眼).

The maximum length of a schema name and a table name is 127 char Using length 200 is too long, and 30 is too short. Note however you do need to use varchar, because although in the system tables the type for names is char, Redshift writes multi-byte UTF-8 into those columns, so if you say UNLOAD them, and then COPY them into duplicates of the original tables, the COPY will fail on the basis you are loading UTF-8 into char (rolls eyes).

您正在使用 AZ64,这很有趣,因为我确定您不知道要在什么样的数据上使用它,而且您知道您不知道!因为没有人这样做;AWS 从未说过它是如何工作的.实际上,它从表面上看是一个类似游程长度的编码器,尽管开发人员告诉我它实际上是一个类似 delta 的编码器.我还没有进行比我最初的调查更进一步的实验,它看起来像游程一样.无论如何,您没有排序,因此AZ64对于这些列来说,无论是游程长度还是增量绝对是不正确的编码选择.

You're using AZ64, which is interesting, because I'm certain you do not know what kind of data to use it on, and you know you don't know! because no one does; AWS have never said how it works. As it is, it appears on the face of it to be a runlength-like encoder, although I've been told by a dev it's really a delta-like encoder. I've yet to experiment further than my initial investigations where it looked runlength-like. In any event, you have no sorting, so AZ64 be it runlength or delta is absolutely an incorrect encoding choice for those columns.

使用一行 INSERT 对 Redshift 来说是灾难性的.我强烈建议您使用 Postgres,除非您确实拥有超过 1 TB 的数据.我可能错了,但我认为您可能不太了解 Redshift,而 Redshift 实际上是一个高度知识密集型的、相当狭窄的用例数据仓库.只有在没有替代方案时才使用它,因为要正确使用它必须受到相当大的约束和限制,当然,如果用户一开始就没有意识到这些约束和限制,那就不是可能会顺利 - 如果您的数据很少,那没关系,因为硬件会淹没数据,但是如果您开始拥有更多数据,我认为您很可能会发现 Redshift 的性能似乎很慢且不稳定,并且您需要添加越来越多的硬件来处理它,因此它看起来很昂贵.实际上只是 Redshift 操作不正确.Redshift 能够在且仅在正确操作的情况下处理大数据,即便如此,您所获得的只是大数据上的及时 SQL.您没有获得大量用户、大量小/快速插入、大量更新等 - 根据我的经验,人们认为可以完成的事情.

The use of one-row INSERT is catastrophic for Redshift. I would strongly advise you, unless you really do have more than a terabyte of data, to use Postgres. I may be wrong, but I think you may not be well-informed about Redshift, and Redshift is in fact a highly knowledge-intensive, rather narrow use-case data warehouse. It is something which is used only when there are no alternatives, because of the considerable constraints and restrictions which must be honoured to use it correctly, and of course if the user is not aware of those constraints and restrictions in the first place, it's not likely to go well - if you have little data, it just won't matter, since the hardware will overwhelm the data, but if you begin to have more data, you are I think very likely to find Redshift's performance seems slow and erratic, and you need to add more and more hardware to deal with it, and so it then appears expensive. In fact it is only that Redshift is being operated incorrectly. Redshift is capable of handling Big Data when and only when operated correctly, and even then, all you get is timely SQL on Big Data. You do not get lots of users, lots of small/fast inserts, lots of updates, etc - things which people in my experience assume can be done.

这篇关于删除 Redshift 中的循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆