如何提高 BigQuery 中 GeoIP 查询的性能? [英] How to improve performance of GeoIP query in BigQuery?
问题描述
我已经在 BigQuery 中加载了我的应用程序日志,我需要根据这些日志中的 IP 地址计算国家/地区.
I have loaded my application logs in BigQuery and I need to calculate country based on IP address from those logs.
我在我的表和我从 MaxMind 下载的 GeoIP 映射表之间编写了一个连接查询.
一个理想的查询是带有范围过滤器的 OUTER JOIN
,但是 BQ
在连接条件中只支持 =
.因此,查询执行 INNER JOIN
并处理 JOIN
每一侧的缺失值.
An ideal query would be OUTER JOIN
with range filter, however BQ
supports only =
in join conditions.
So the query does an INNER JOIN
and handles missing values in each side of the JOIN
.
我修改了我的原始查询,以便它可以在维基百科公共数据集上运行.
I have amended my original query so it could run on the Wikipedia public data set.
有人可以帮我让它运行得更快吗?
Can someone please help me make this run faster?
SELECT id, client_ip, client_ip_code, B.Country_Name as Country_Name
FROM
(SELECT id, contributor_ip as client_ip, INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code, 1 AS One
FROM [publicdata:samples.wikipedia] Limit 1000) AS A1
JOIN
(SELECT From_IP_Code, To_IP_Code, Country_Name, 1 AS One
FROM
-- 3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the end of the set
-- all Ranges of valid IPs:
(SELECT From_IP_Code, To_IP_Code, Country_Name FROM [QA_DATASET.GeoIP])
-- Missing rages lower from From_IP
,(SELECT
PriorRangeEndIP + 1 From_IP_Code,
From_IP_Code - 1 AS To_IP_Code,
'NA' AS Country_Name
FROM
-- use of LAG function to find prior valid range
(SELECT
From_IP_Code,
To_IP_Code, Country_Name,
LAG(To_IP_Code, 1, INTEGER(0))
OVER(ORDER BY From_IP_Code asc) PriorRangeEndIP
FROM [QA_DATASET.GeoIP]) A
-- If gap from prior valid range is > 1 than its a gap to fill
WHERE From_IP_Code > PriorRangeEndIP + 1)
-- Missing rages higher tan Max To_IP
,(SELECT MAX(To_IP_Code) + 1 as From_IP_Code, INTEGER(4311810304) as To_IP_Code, 'NA' AS Country_Name
FROM [QA_DATASET.GeoIP])
) AS B
ON A1.ONE = B.ONE -- fake join condition to overcome allowed use of only = in joins
-- Join condition where valid IP exists on left
WHERE
A1.client_ip_code >= B.From_IP_Code
AND A1.client_ip_code <= B.To_IP_Code
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code = 1) -- where there is no valid IP on left contributor_ip
推荐答案
2019,大大改进的答案:
#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
SELECT ip, country_name, c
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC
<小时>
此答案的清理版本:http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html
让我整理一下原始查询:
Let me tidy the original query:
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
1 AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
1000
) AS A1
LEFT JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
1 AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
这是一个很长的查询!(和一个非常有趣的).它在 14 秒内运行.我们如何优化它?
That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?
我发现的一些技巧:
- 跳过 NULL.如果日志中没有ip地址,请不要尝试匹配.
- 减少组合.不是将每个左侧记录与每个右侧记录都连接起来,而是将左侧的 39.x.x.x 记录与右侧的 39.x.x.x 记录连接起来如何.只有少数(3 或 4)条规则涵盖多个范围.很容易在 geolite 表上添加几条规则来添加规则来弥补这些差距.
所以我正在改变:
1 AS One
到INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
(两次).- 添加WHEREcontributor_ip IS NOT NULL".
现在它在 3 秒内运行!5% 的 ip 无法定位,可能是由于描述的差距(容易修复).
And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).
现在,从 LIMIT 1000 到 LIMIT 300000 怎么样.需要多长时间?
Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?
37 秒!比描述的 25 分钟好多了.如果你想更上一层楼,我建议把右边的桌子变成静态的——因为一旦计算出来它根本不会改变,它只是基本规则的扩展.然后你可以使用 JOIN EACH.
37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
300000
) AS A1
JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
INTEGER(From_IP_Code/(256*256*256)) AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
这篇关于如何提高 BigQuery 中 GeoIP 查询的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!