使用SQL从网址中剥离域名 [英] Stripping the domain name from a url using SQL
问题描述
我正在使用Big Query的Hacker News数据集,并正在研究哪些URL具有最多的新闻故事.我还想删除域名,然后看看其中哪些新闻报道最多.我正在R中工作,但在使以下查询正常工作时遇到了一些麻烦.
I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.
# Select the ten domains that have the most stories
sql_domain <- "SELECT url REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//')
THEN url ELSE CONCAT('http://', url) END, '&', '?') as domain_name,
COUNT(domain_name) as story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY domain_name
ORDER BY story_number DESC
LIMIT 10"
我不需要剥离顶级域名;例如,不需要stackoverflow
,stackoverflow.com
就可以.非常感谢您的帮助!
I don't need to strip the top-level domain; for example, stackoverflow
isn't required, stackoverflow.com
is fine. Your help is greatly appreciated!
推荐答案
问题出在查询中-您应按以下方式使用(适用于BigQuery Standard SQL)
The problem is in your query - you should use as below (for BigQuery Standard SQL)
SELECT
NET.REG_DOMAIN(url) AS domain_name,
COUNT(NET.REG_DOMAIN(url)) AS story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY story_number DESC
LIMIT 10
这将为您提供类似
Row domain_name story_number
1 github.com 81784
2 medium.com 71953
3 youtube.com 58119
4 blogspot.com 52925
5 nytimes.com 48986
6 techcrunch.com 43924
7 google.com 26326
8 wordpress.com 23372
9 arstechnica.com 23162
10 wired.com 18480
这篇关于使用SQL从网址中剥离域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!