如何提高BigQuery中GeoIP查询的性能? [英] How to improve performance of GeoIP query in BigQuery?

查看:121
本文介绍了如何提高BigQuery中GeoIP查询的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在BigQuery中加载了我的应用程序日志,并且需要根据这些日志中的IP地址计算国家。



我已经在我的表和我从 MaxMind 下载的GeoIP映射表。



一个理想的查询将是带有范围过滤器的 OUTER JOIN ,但是 BQ 仅支持 = 在连接条件中。
因此,查询执行 INNER JOIN 并处理 JOIN 中每一侧的缺失值。 p>

我修改了我的原始查询,以便它可以运行在维基百科公共数据集上。



有人可以帮我吗使这个运行更快?

  SELECT id,client_ip,client_ip_code,B.Country_Name as Country_Name 

FROM
(SELECT id,contributor_ip as client_ip,INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,1 AS One
FROM [publicdata:samples.wikipedia]限制1000)AS A1

JOIN
(选择From_IP_Code,To_IP_Code,Country_Name,1 AS一个
FROM

- 3个IP集合:1.有效范围,2.差距,3.最后差距的集合
- 所有有效IP的范围:
(SELECT From_IP_Code,To_IP_Code,Country_Name FROM [QA_DATASET.GeoIP])

- 缺少From_IP $ b的怒气$ b,(SELECT
PriorRangeEndIP + 1 From_ IP_Code,
From_IP_Code - 1 AS To_IP_Code,
'NA'AS Country_Name
FROM

- 使用LAG函数查找先前的有效范围
( SELECT
From_IP_Code,
To_IP_Code,Country_Name,
LAG(To_IP_Code,1,INTEGER(0))
OVER(ORDER BY From_IP_Code asc)PriorRangeEndIP
FROM [QA_DATASET。 GeoIP])A

- 如果与先前有效范围的差距>> 1比填补
的差距WHERE From_IP_Code> PriorRangeEndIP + 1)

- 丢失的怒气更高tan最大To_IP
,(SELECT MAX(To_IP_Code)+1作为From_IP_Code,INTEGER(4311810304)作为To_IP_Code,'NA'AS Country_Name
FROM [QA_DATASET.GeoIP])
)AS B
ON A1.ONE = B.ONE - 假连接条件克服在连接中只允许使用= b
$ b - 加入左边存在有效IP的条件
WHERE
A1.client_ip_code> = B.From_IP_Code
AND A1.client_ip_code <= B.To_IP_Code
OR(A1。 client_ip_code IS NULL
AND B.From_IP_Code = 1) - 左边没有有效IP contributor_ip


http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html



<让我整理原始查询:

  SELECT 
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM(
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,
1 AS
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip不是NULL
LIMIT
1000
)AS A1
LEFT JOIN

SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
1 AS
FROM
--3 IP集:1.有效范围, (
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[ ) - 所有范围ov有效IP


SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA'AS Country_Name - 缺少的怒气低于FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name

LAG(To_IP_Code,
1,
INTEGER(0))OVER(
ORDER BY
From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围
FROM
[playscape-proj:GeoIP。 GeoIP])A
WHERE
From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围IS的差距大于1,那么它与填充
的差距,

SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER(4311810304)AS To_IP_Code,
'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
)AS B
ON A1.ONE = B.ONE - 使JOIN条件克服允许使用= only IN连接
WHERE
A1.client_ip_code> = B.From_IP_Code
AND A1.client_ip_code <= B.To_IP_Code - JOIN条件WHERE有效的IP存在ON左
OR(A1.client_ip_code IS NULL
AND B.From_IP_Code = 1) - WHERE不存在有效IP ON left contributor_ip;

这是一个长查询! (和一个非常有趣的)。它在14秒内运行。


$ b 跳过空白。如果日志中没有ip地址,请不要尝试匹配它。
  • 减少组合。而不是使用每条右侧记录加入每条左侧记录,而只需将左侧的39.x.x.x记录与右侧的39.x.x.x记录相连接。只有少数(3或4)规则涵盖多个范围。在geolite表中添加一些规则以添加规则来弥补这些差距是很容易的。



  • 所以我正在改变:


    • 1 AS One to INTEGER(PARSE_IP

    • 添加一个'WHERE contributor_ip不为空'。
    • $($ contrib_ip)/(256 * 256 * 256))AS One
      b $ b


    现在它在3秒内运行! 5%的ips不能被定位,可能是由于所描述的差距(简单修复)。



    现在,从LIMIT 1000到LIMIT 300000的过程如何?会花费吗?

    <37>!比描述的25分钟好得多。如果你想走得更高,我会建议把右边的桌子变成一个静态桌子 - 就像曾经计算过的那样,它根本不会改变,这只是基本规则的扩展。然后你可以使用JOIN EACH。

    pre $ SELECT
    id
    client_ip
    client_ip_code,
    B.Country_Name AS Country_Name
    FROM(
    SELECT
    id,
    contributor_ip AS client_ip,
    INTEGER(PARSE_IP(contributor_ip))AS client_ip_code,
    INTEGER(PARSE_IP(contributor_ip)/(256 * 256 * 256))AS
    FROM
    [publicdata:samples.wikipedia]
    WHERE contributor_ip不是NULL
    LIMIT
    300000
    )AS A1
    JOIN

    SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    INTEGER( From_IP_Code /(256 * 256 * 256))AS
    FROM
    --3 IP集合:1.有效范围,2.Gaps,3.集合结束处的空位

    SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name
    FROM
    [playscape-proj:GeoIP.GeoIP]) - 所有范围ov有效IP


    SELECT
    PriorRangeEndIP + 1 From_IP_Code,
    From_IP_Code-1 AS To_IP_Code,
    'NA'AS Country_Name - 缺少愤怒低于FROM_IP
    from(
    SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name

    LAG(To_IP_Code,
    1,
    INTEGER(0))OVER(
    ORDER BY
    From_IP_Code ASC)PriorRangeEndIP - 使用LAG函数查找先前的有效范围
    FROM
    [playscape-proj:GeoIP.GeoIP])A
    WHERE
    From_IP_Code> PriorRangeEndIP + 1) - 如果与先前有效范围的差距IS> 1比填补


    SELECT
    MAX(To_IP_Code)+1 AS From_IP_Code,
    INTEGER(4311810304)AS To_IP_Code,
    'NA'AS Country_Name - 丢失的怒气更高tan最大值To_IP
    来自
    [playscape-proj:GeoIP.GeoIP])
    )作为B
    对于A1.ONE = B.ONE - 使JOIN条件克服允许的使用=只有IN连接
    WHERE
    A1.client_ip_code> = B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code - 连接条件WHERE有效IP存在ON左
    OR(A1.client_ip_code IS NULL
    和B.From_IP_Code = 1) - 哪里没有有效的IP ON left contributor_ip;


    I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.

    I have written a join query between my table and a GeoIP mapping table that I downloaded from MaxMind.

    An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions. So the query does an INNER JOIN and handles missing values in each side of the JOIN.

    I have amended my original query so it could run on the Wikipedia public data set.

    Can someone please help me make this run faster?

    SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name
    
    FROM
        (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
        FROM [publicdata:samples.wikipedia] Limit 1000) AS A1
    
    JOIN 
        (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
        FROM
    
            -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
            -- all Ranges of valid IPs:
            (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])
    
            -- Missing rages lower from From_IP 
            ,(SELECT
                PriorRangeEndIP + 1 From_IP_Code, 
                From_IP_Code - 1 AS To_IP_Code, 
                'NA' AS Country_Name
            FROM
    
                -- use of LAG function to find prior valid range
                (SELECT 
                    From_IP_Code, 
                    To_IP_Code, Country_Name, 
                    LAG(To_IP_Code, 1, INTEGER(0)) 
                    OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP                 
                FROM [QA_DATASET.GeoIP]) A
    
                -- If gap from prior valid range is > 1 than its a gap to fill
                WHERE From_IP_Code > PriorRangeEndIP + 1)
    
            -- Missing rages higher tan Max To_IP
            ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
            FROM [QA_DATASET.GeoIP])
        ) AS B
    ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins
    
    -- Join condition where valid IP exists on left
    WHERE
        A1.client_ip_code >= B.From_IP_Code
        AND A1.client_ip_code <= B.To_IP_Code
        OR (A1.client_ip_code IS NULL 
        AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip
    

    解决方案

    Cleaned up version of this answer at: http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

    Let me tidy the original query:

    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        1 AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        1000
        ) AS A1
    LEFT JOIN
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        1 AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
    

    That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?

    Some tricks I found:

    • Skip NULLs. If there is no ip address in a log, don't try to match it.
    • Reduce the combinations. Instead of JOINing every left side record with every right side record, how about joining only the 39.x.x.x records on the left side with the 39.x.x.x records on the right side. There are only a few (3 or 4) rules that cover multiple ranges. It would be easy to add a couple of rules on the geolite table to add rules to cover these gaps.

    So I'm changing:

    • 1 AS One to INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One (twice).
    • Adding a 'WHERE contributor_ip IS NOT NULL`.

    And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).

    Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?

    37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.

    SELECT
      id,
      client_ip,
      client_ip_code,
      B.Country_Name AS Country_Name
    FROM (
      SELECT
        id,
        contributor_ip AS  client_ip,
        INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
        INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
      FROM
        [publicdata:samples.wikipedia]
      WHERE contributor_ip IS NOT NULL
      LIMIT
        300000
        ) AS A1
    JOIN 
      (
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name,
        INTEGER(From_IP_Code/(256*256*256)) AS One
      FROM
        --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
        (
        SELECT
          From_IP_Code,
          To_IP_Code,
          Country_Name
        FROM
          [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
        ,
        (
        SELECT
          PriorRangeEndIP+1 From_IP_Code,
          From_IP_Code-1 AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
        from(
          SELECT
            From_IP_Code,
            To_IP_Code,
            Country_Name
            ,
            LAG(To_IP_Code,
              1,
              INTEGER(0)) OVER(
            ORDER BY
              From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
          FROM
            [playscape-proj:GeoIP.GeoIP])A
        WHERE
         From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
          ,
        (
        SELECT
          MAX(To_IP_Code)+1 AS From_IP_Code,
          INTEGER (4311810304) AS To_IP_Code,
          'NA' AS Country_Name -- Missing rages higher tan Max To_IP
        FROM
          [playscape-proj:GeoIP.GeoIP])
        ) AS B
      ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
    WHERE
      A1.client_ip_code>=B.From_IP_Code
      AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
      OR (A1.client_ip_code IS NULL
        AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;
    

    这篇关于如何提高BigQuery中GeoIP查询的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆