如何在redshift中使用子查询中外部表的值? [英] How to use value from the outer table in my subqueries in redshift?

查看:95
本文介绍了如何在redshift中使用子查询中外部表的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询,如下所示,在其中我有三个不同的CTE的 dates active featuretype 选择外部查询.

I have a query as shown below in which I have three different CTE's dates, active and featuretype which I am using in my main select outer query.

WITH dates AS (
  SELECT 
    (
      DATE_TRUNC(
        'week', 
        getdate () + INTERVAL '1 day'
      ):: DATE - 7 *(
        ROW_NUMBER() OVER (
          ORDER BY 
            TRUE
        ) -1
      ) - INTERVAL '1 day'
    ):: DATE AS week_of 
  FROM 
    (
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X 
      UNION ALL 
      SELECT 
        1 AS X
    )
), 
active as (
  select 
    client_id, 
    update_timestamp, 
    version, 
    status 
  from 
    kites.customer c1 
  where 
    (
      c1.is_customer = false 
      or c1.is_customer is NULL
    ) 
    AND c1.status IN ('TEST1', 'TEST2', 'TEST3') 
    and c1.mgt_name = 'ABC'
), 
featuretype AS (
  SELECT 
    * 
  FROM 
    kites.program 
  WHERE 
    client_type = 'compliance-data' 
    AND client_status = 'TEST1'
) 
SELECT 
  dates.week_of, 
  DATE_PART(d, dates.week_of) AS day, 
  count(DISTINCT active.client_id) as total_count, 
  COUNT(
    CASE WHEN active.status = 'TEST1' THEN active.client_id END
  ) AS count1, 
FROM 
  active 
  join dates on active.update_timestamp <= dates.week_of 
  LEFT JOIN featuretype p1 ON active.client_id = p1.client_id
-- how can I rewrite this where subquery by avoiding correlated subquery error?
WHERE 
  active.version = (
    Select 
      MAX(version) 
    from 
      kites.customer c2 
    where 
      active.client_id = c2.client_id 
      and c2.update_timestamp <= dates.week_of
  ) 
  AND p1.client_version = (
    Select 
      MAX(client_version) 
    from 
      kites.program p2 
    where 
      p1.client_id = p2.client_id 
      AND p1.client_type = p2.client_type 
      AND p2.update_timestamp <= dates.week_of
  ) 
GROUP BY 
  week_of 
ORDER by 
  week_of DESC 
limit 
  7;

我试图弄清楚如何使用join或任何其他方式在我的where子句中最后写入两个子查询.截至目前,当我按原样运行此查询时,它给了我下面的错误.我尝试使用JOIN转换那些where子句子查询,但仍然遇到相同的错误.

I am trying to figure out on how to write my two subqueries in my where clause at the end using join or any other way. As of now when I run this query as it is, it gives me below error. I tried converting those where clause subquery using JOIN but I still get same error.

Invalid operation: This type of correlated subquery pattern is not supported due to internal error

由于我需要在子查询中使用外部表(CTE)列中的值,因此有什么方法可以在redshift中使用侧向联接?阅读更多有关看起来无法使用的内容.有什么办法吗?

Is there any way I can use lateral join in redshift since I need to use value from the outer table (CTE's) column in my subqueries? Reading more about looks like I can't use it. Is there any way to do this?

推荐答案

一个相关的子查询是当您需要为表的每一行重新评估选择子句时.当处理Redshift上的超大型数据库时,这会产生大量的重新评估和运行时.结合Redshift的网络集群性质,在网络上推动了很多这种重新评估,您可以看到这些类型的查询结构是如何造成问题的.

A correlated sub-query is when you need to reevaluate a select clause for every row of a table. When dealing with the very large databases that are on Redshift this creates massive amounts of reevaluation and runtime. Combined with the network clustered nature of Redshift that drives a lot of this reevaluation across the networks and you can see how these types of query structures create problems.

您的查询有2个这样的相关子查询,并且有点大,无法理解如何处理,因此让我将其缩减,以便说明如何进行攻击.使用这个简化的版本,我将尝试展示如何做到这一点.

You query has 2 such correlated sub-queries and is a bit large to gain understanding of how to address so let me cut it down so show how you can attack it. Using this reduced version I'll try to show how it can be done.

WITH active AS
(
       SELECT client_id,
              update_timestamp,
              version,
              status
       FROM   kites.customer c1
       WHERE  (
                     c1.is_customer = FALSE
              OR     c1.is_customer IS NULL )
       AND    c1.status IN ('TEST1',
                            'TEST2',
                            'TEST3')
       AND    c1.mgt_name = 'ABC' )
SELECT   dates.week_of,
         Date_part(d, dates.week_of)      AS day,
         Count(DISTINCT active.client_id) AS total_count
from     active
join     dates
ON       active.update_timestamp <= dates.week_of
WHERE    active.version =
         (
                SELECT max(version)
                FROM   kites.customer c2
                WHERE  active.client_id = c2.client_id
                AND    c2.update_timestamp <= dates.week_of )
GROUP BY week_of
ORDER BY week_of DESC limit 7;

现在,我们只有1个相关子查询-活动"查询.基于表客户,因此WHERE子句中的select也是如此.这本身不是问题,但是由于WHERE子句需要当前行client_id和日期范围的MAX(version),因此您可以看到该WHERE子句无法解析为一组数据,并且需要更改活动中的每一行.

Now we have just the 1 correlated sub-query - "active" is based on the table customer and so is the select in the WHERE clause. This isn't a problem in itself but since the WHERE clause wants the MAX(version) for the current rows client_id and for a range of dates you can see that this WHERE clause cannot resolve to one set of data and keeps needing to change for every row in active.

解决方法是创建具有所有可能的MAX(version)的数据集,并将此集合加入ON active.version = the_computed_max_version_for_this_client_id_and_date_range.我们只需要计算这个集合.由于这是每个client_id的日期范围内的MAX(),因此需要使用MAX()窗口函数.

The fix is to create a set of data that has all the possible MAX(version)'s and join this set in ON active.version = the_computed_max_version_for_this_client_id_and_date_range. We just need to compute this set. Since this is a MAX() over a range of dates for each client_id it calls for a MAX() window function.

MAX(version) OVER (PARTITION by client_id ORDER BY dates.week_of ROWS UNBOUNDED PRECEDING) as max_version

将它们放在一起,我们将首次尝试

Putting these together we get a first attempt of

WITH active AS
(
       SELECT client_id,
              update_timestamp,
              version,
              status
       FROM   kites.customer c1
       WHERE  (
                     c1.is_customer = FALSE
              OR     c1.is_customer IS NULL )
       AND    c1.status IN ('TEST1',
                            'TEST2',
                            'TEST3')
       AND    c1.mgt_name = 'ABC' )
SELECT   dates.week_of,
         Date_part(d, dates.week_of)      AS day,
         Count(DISTINCT active.client_id) AS total_count
FROM     active
join     dates
ON       active.update_timestamp <= dates.week_of
join
         (
                  SELECT   Max(version) over (PARTITION BY client_id ORDER BY dates.week_of ROWS UNBOUNDED PRECEDING) AS max_version
                  FROM     kites.customer c2
                  join     dates
                  ON       c2.update_timestamp <= dates.week_of
                  WHERE    c2.update_timestamp <= dates.week_of ) maxv
ON       active.version = maxv.max_version
GROUP BY week_of
ORDER BY week_of DESC limit 7;

我不能保证这是完全正确的,因为我没有可以测试的数据,但希望这可以给您一个开始.如果您需要更多具体的帮助,则可能需要提供数据设置和预期结果,我相信社区可以使它正常工作.

I cannot promise this is fully correct as I don't have data to test with but hopefully this gives you a start. If you need more specific help you will likely need to provide data setup and expected results and I'm sure the community can get it working.

添加可能需要的第二版本(取决于您的数据):

Adding a 2nd version that may be what is needed (depends on the the data you have):

WITH active AS
(
       SELECT client_id,
              update_timestamp,
              version,
              status
       FROM   kites.customer c1
       WHERE  (
                     c1.is_customer = FALSE
              OR     c1.is_customer IS NULL )
       AND    c1.status IN ('TEST1',
                            'TEST2',
                            'TEST3')
       AND    c1.mgt_name = 'ABC' )
SELECT   dates.week_of,
         Date_part(d, dates.week_of)      AS day,
         Count(DISTINCT active.client_id) AS total_count
FROM     active
join     dates
ON       active.update_timestamp <= dates.week_of
join
         (
                  SELECT   Max(version) over (PARTITION BY client_id ORDER BY dates.week_of ROWS UNBOUNDED PRECEDING) AS max_version,
                      client_id, week_of         
                  FROM     kites.customer c2
                  join     dates
                  ON       c2.update_timestamp <= dates.week_of
                  WHERE    c2.update_timestamp <= dates.week_of ) maxv
ON       active.version = maxv.max_version 
   and active.client_id = maxv.client_id
   and date.week_of = maxv.week_of
GROUP BY week_of
ORDER BY week_of DESC limit 7;

这篇关于如何在redshift中使用子查询中外部表的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆