如何提高 BigQuery 中 GeoIP 查询的性能? [英] How to improve performance of GeoIP query in BigQuery?

查看:27
本文介绍了如何提高 BigQuery 中 GeoIP 查询的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在 BigQuery 中加载了我的应用程序日志,我需要根据这些日志中的 IP 地址计算国家/地区.

I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.

我在我的表和我从 MaxMind 下载的 GeoIP 映射表之间编写了一个连接查询.

一个理想的查询是带有范围过滤器的 OUTER JOIN,但是 BQ 在连接条件中只支持 =.因此,查询执行 INNER JOIN 并处理 JOIN 每一侧的缺失值.

An ideal query would be OUTER JOIN with range filter, however BQ supports only = in join conditions. So the query does an INNER JOIN and handles missing values in each side of the JOIN.

我修改了我的原始查询,以便它可以在维基百科公共数据集上运行.

I have amended my original query so it could run on the Wikipedia public data set.

有人可以帮我让它运行得更快吗?

Can someone please help me make this run faster?

SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name

FROM
    (SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
    FROM [publicdata:samples.wikipedia] Limit 1000) AS A1

JOIN 
    (SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
    FROM

        -- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
        -- all Ranges of valid IPs:
        (SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])

        -- Missing rages lower from From_IP 
        ,(SELECT
            PriorRangeEndIP + 1 From_IP_Code, 
            From_IP_Code - 1 AS To_IP_Code, 
            'NA' AS Country_Name
        FROM

            -- use of LAG function to find prior valid range
            (SELECT 
                From_IP_Code, 
                To_IP_Code, Country_Name, 
                LAG(To_IP_Code, 1, INTEGER(0)) 
                OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP                 
            FROM [QA_DATASET.GeoIP]) A

            -- If gap from prior valid range is > 1 than its a gap to fill
            WHERE From_IP_Code > PriorRangeEndIP + 1)

        -- Missing rages higher tan Max To_IP
        ,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
        FROM [QA_DATASET.GeoIP])
    ) AS B
ON A1.ONE = B.ONE    -- fake join condition to overcome allowed use of only = in joins

-- Join condition where valid IP exists on left
WHERE
    A1.client_ip_code >= B.From_IP_Code
    AND A1.client_ip_code <= B.To_IP_Code
    OR (A1.client_ip_code IS NULL 
    AND B.From_IP_Code = 1)    -- where there is no valid IP on left contributor_ip

推荐答案

2019,大大改进的答案:

#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
  SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0')  ip, COUNT(*) c
  FROM `publicdata.samples.wikipedia`
  WHERE contributor_ip IS NOT null  
  GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
  SELECT ip, country_name, c
  FROM (
    SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
    FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
    WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
  )
  JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`  
  USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC

<小时>

此答案的清理版本:http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html

让我整理一下原始查询:

Let me tidy the original query:

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    1 AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    1000
    ) AS A1
LEFT JOIN
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    1 AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;

这是一个很长的查询!(和一个非常有趣的).它在 14 秒内运行.我们如何优化它?

That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?

我发现的一些技巧:

  • 跳过 NULL.如果日志中没有ip地址,请不要尝试匹配.
  • 减少组合.不是将每个左侧记录与每个右侧记录都连接起来,而是将左侧的 39.x.x.x 记录与右侧的 39.x.x.x 记录连接起来如何.只有少数(3 或 4)条规则涵盖多个范围.很容易在 geolite 表上添加几条规则来添加规则来弥补这些差距.

所以我正在改变:

  • 1 AS OneINTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One(两次).
  • 添加WHEREcontributor_ip IS NOT NULL".

现在它在 3 秒内运行!5% 的 ip 无法定位,可能是由于描述的差距(容易修复).

And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).

现在,从 LIMIT 1000 到 LIMIT 300000 怎么样.需要多长时间?

Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?

37 秒!比描述的 25 分钟好多了.如果你想更上一层楼,我建议把右边的桌子变成静态的——因为一旦计算出来它根本不会改变,它只是基本规则的扩展.然后你可以使用 JOIN EACH.

37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.

SELECT
  id,
  client_ip,
  client_ip_code,
  B.Country_Name AS Country_Name
FROM (
  SELECT
    id,
    contributor_ip AS  client_ip,
    INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
    INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
  FROM
    [publicdata:samples.wikipedia]
  WHERE contributor_ip IS NOT NULL
  LIMIT
    300000
    ) AS A1
JOIN 
  (
  SELECT
    From_IP_Code,
    To_IP_Code,
    Country_Name,
    INTEGER(From_IP_Code/(256*256*256)) AS One
  FROM
    --3 IP sets: 1.valid ranges,  2.Gaps,  3. Gap at the END of the set
    (
    SELECT
      From_IP_Code,
      To_IP_Code,
      Country_Name
    FROM
      [playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
    ,
    (
    SELECT
      PriorRangeEndIP+1 From_IP_Code,
      From_IP_Code-1 AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages lower    FROM      From_IP
    from(
      SELECT
        From_IP_Code,
        To_IP_Code,
        Country_Name
        ,
        LAG(To_IP_Code,
          1,
          INTEGER(0)) OVER(
        ORDER BY
          From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
      FROM
        [playscape-proj:GeoIP.GeoIP])A
    WHERE
     From_IP_Code>PriorRangeEndIP+1) -- If gap  FROM  prior valid range IS >1 than its a gap to fill
      ,
    (
    SELECT
      MAX(To_IP_Code)+1 AS From_IP_Code,
      INTEGER (4311810304) AS To_IP_Code,
      'NA' AS Country_Name -- Missing rages higher tan Max To_IP
    FROM
      [playscape-proj:GeoIP.GeoIP])
    ) AS B
  ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
  A1.client_ip_code>=B.From_IP_Code
  AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
  OR (A1.client_ip_code IS NULL
    AND B.From_IP_Code=1 ) -- WHERE  there IS no valid IP ON left contributor_ip;

这篇关于如何提高 BigQuery 中 GeoIP 查询的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆