Redshift - 简化查询计划 [英] Redshift - Simplify Query Plan

查看:16
本文介绍了Redshift - 简化查询计划的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Redshift 中有两个表,我试图将它们连接起来,以根据用户规范化的 IP 地址获取邮政编码人口统计数据.通过规范化地址,我的意思是它与一个统一长度的字符串一致,该字符串去掉了句点并且可以直接相互比较.例如,这在任何连接完成之前应用于所有 ip 并存储在表中:

I have two tables in Redshift that I am trying to do a join on to get zip code demographics based on a users normalized ip address. By normalized address, I mean that it is concerted to a uniform length string that has the periods stripped out and can be directly compared to one another. For example, this is applied to all ips before any joins are done and is stored in the tables:

lpad(split_part(ip, '.', 1), 3, '0') ||
lpad(split_part(ip, '.', 2), 3, '0') ||
lpad(split_part(ip, '.', 3), 3, '0') ||
lpad(split_part(ip, '.', 4), 3, '0')

所以 209.170.151.71 将转换为 209170151071.

我有两张桌子.第一个是visitor_details,其中包含以下内容:

I have two tables. This first is visitor_details which contains the following:

-----------------------------
| visitor_id |      ip      |
-----------------------------
|      1     | 209170151071 |
|      2     | 123170167071 |
      ...           ...
| 50000000   | 001213020341 |
-----------------------------

我有一个名为 geo_ip 的表,它具有以下结构:

The I have a table called geo_ip which has the following structure:

----------------------------------------
|    start_ip |    end_ip      |  zip  |
----------------------------------------
|209170151071 | 209170151071   | 11101 |
|309170151071 | 409170151071   | 11102 |
      ...           ...           ...
|509170151071 | 609170151071   | 11103 |
----------------------------------------

我正在尝试运行以下查询:

I'm trying to run the following query:

WITH vd AS (
  SELECT visitor_id,
         ip_address as c_ip
  FROM dev.visitor_details
)
SELECT
  visitor_id,
  c_ip,
  g.*
FROM
  vd
JOIN
  dev.geo_ip g
  ON vd.c_ip BETWEEN g.startip and g.endip
LIMIT 500;

geo ip 上的排序键是使用 startip 和 endip 的交错排序键.桌子似乎也没有倾斜.但是,运行查询会导致很长的执行时间(从未完成).查看解释,我看到以下内容:

The sort keys on geo ip are an interleaved sort key using both startip and endip. The table also doesn't seem to be skewed. However, running the query results in a very long execution time (never completed). Looking at the explain I see the following:

XN Limit  (cost=0.00..245.17 rows=500 width=238)
   ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..18442148764959.20 rows=37610983146614 width=238)
         Join Filter: ((("inner".startip)::text <= ("outer".ip_address)::text) AND (("inner".endip)::text >= ("outer".ip_address)::text))
         ->  XN Seq Scan on visitor_details  (cost=0.00..596971.20 rows=59697120 width=72)
         ->  XN Seq Scan on geo_ip g  (cost=0.00..56702.71 rows=5670271 width=166)
 ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----

更奇怪的是,如果我为连接硬编码一个 IP 地址,查询计划看起来很正常.

What is odder, is that if I hard code an ip address for the join the query plan looks normal.

谁能就如何优化表设置的查询以使其有效运行提出任何建议?

Can anyone make any suggestions on how to either optimize the query of the table setup to make it run efficiently?

更新

我做了第一个响应建议的更改,但我仍然看到嵌套循环.所有 IP 现在都是 bigint,并且删除了 with 语句.

I made the changes suggested by the first response, but I am still seeing the nested loop. All IPs are now bigints and the with statement was removed.

explain SELECT 
    vd.visitor_id,
    vd.ip_address,
    gi.zip
FROM
dev.visitor_details2 vd
JOIN dev.geo_ip3 gi ON vd.ip BETWEEN gi.startip and gi.endip
LIMIT 500;


                                               QUERY PLAN                                                
---------------------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..136.62 rows=500 width=51)
   ->  XN Nested Loop DS_BCAST_INNER  (cost=0.00..10276958524959.20 rows=37610983146614 width=51)
         Join Filter: (("inner".startip <= "outer".ip) AND ("inner".endip >= "outer".ip))
         ->  XN Seq Scan on visitor_details2 vd  (cost=0.00..596971.20 rows=59697120 width=52)
         ->  XN Seq Scan on geo_ip3 gi  (cost=0.00..56702.71 rows=5670271 width=23)
 ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
(6 rows)

更新 2以下是用于确认它们都是 bigint 的表定义:

UPDATE 2 Here are the table definitions to confirm that they are both bigint:

master=# \d dev.visitor_details2;
          Table "dev.visitor_details2"
   Column   |          Type          | Modifiers 
------------+------------------------+-----------
 id         | integer                | not null
 visitor_id | character varying(108) | 
 ip         | bigint                 | 
 ip_address | character varying(192) | 
 domain     | integer                | 
Indexes:
    "visitor_details2_pkey" PRIMARY KEY, btree (id)

master=# \d dev.geo_ip3;
                Table "dev.geo_ip3"
    Column    |          Type          | Modifiers 
--------------+------------------------+-----------
 startip      | bigint                 | 
 endip        | bigint                 | 
 country      | character varying(16)  | 
 region       | character varying(32)  | 
 city         | character varying(32)  | 
 zip          | character varying(16)  | 
 latitude     | double precision       | 
 longitude    | double precision       | 
 areacode     | integer                | 
 metrocode    | integer                | 
 timezone     | character varying(32)  | 
 isp          | character varying(128) | 
 organization | character varying(128) | 
 netspeed     | character varying(32)  | 
 domain       | character varying(128) | 

推荐答案

我不知道你为什么在这里使用 with 语句.你读过文档吗?我猜正在发生的事情是,它正在为visitor_detail 表中的每个条目执行with 块中的查询,然后必须将其广播到另一个节点XN Nested Loop DS_BCAST_INNER.您还可以看到它加入了文本 Join Filter: ((("inner".startip)::text.您应该考虑将 ip_address 的数据类型更改为 BIGINT.

I'm not sure why you're using the with statement here. Did you read the documentation? I guess what's happening is, that it's executing the query in the with block for every entry in the visitor_detail table and then has to broadcast this to the other node XN Nested Loop DS_BCAST_INNER. Also you can see it's joining on text Join Filter: ((("inner".startip)::text. You should consider changing the data type of the ip_address to BIGINT.

我会这样写查询:

SELECT 
    vd.visitor_id,
    vd.ip_address,
    gi.zip
FROM
dev.visitor_details vd
JOIN geo_ip gi ON vd.ip_address BETWEEN gi.start_ip and gi.end_ip
LIMIT 500;

更新

看起来 Redshift 处理程序在之间"加入的方式非常昂贵.您是否考虑过显式添加这些范围内的所有 ip 地址并使用 ip_address 作为排序键?我知道这个表的行数可能会变得非常大,但是如果您使用适当的压缩(对于 ip 使用 DELTA32K,对于 zip 使用运行长度)并分配到所有节点,这可能是一个解决方案.

It looks like the way Redshift handels joins on 'between' are pretty expensive. Did you think about adding all ip addresses in these range explicit and use the ip_address as the sort key? I know this table can become quite large in number of rows, but if you use proper compression (DELTA32K for ip and runlength for zip) and a distribution to all nodes, this might be a solution.

这篇关于Redshift - 简化查询计划的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆