使用SQL从网址中剥离域名 [英] Stripping the domain name from a url using SQL

查看:63
本文介绍了使用SQL从网址中剥离域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Big Query的Hacker News数据集,并正在研究哪些URL具有最多的新闻故事.我还想删除域名,然后看看其中哪些新闻报道最多.我正在R中工作,但在使以下查询正常工作时遇到了一些麻烦.

I'm working with Big Query's Hacker News dataset, and was looking at which urls have the most news stories. I'd also like to strip the domain names out, and see which of those have the most news stories. I'm working in R, and am having a bit of trouble getting the follow query to work.

# Select the ten domains that have the most stories
sql_domain <- "SELECT url REPLACE(CASE WHEN REGEXP_CONTAINS(url, '//') 
                        THEN url ELSE CONCAT('http://', url) END, '&', '?') as domain_name,
                      COUNT(domain_name) as story_number
                FROM `bigquery-public-data.hacker_news.full`
                WHERE type = 'story'
                GROUP BY domain_name
                ORDER BY story_number DESC
                LIMIT 10"

我不需要剥离顶级域名;例如,不需要stackoverflowstackoverflow.com就可以.非常感谢您的帮助!

I don't need to strip the top-level domain; for example, stackoverflow isn't required, stackoverflow.com is fine. Your help is greatly appreciated!

推荐答案

问题出在查询中-您应按以下方式使用(适用于BigQuery Standard SQL)

The problem is in your query - you should use as below (for BigQuery Standard SQL)

SELECT 
  NET.REG_DOMAIN(url) AS domain_name,
  COUNT(NET.REG_DOMAIN(url)) AS story_number
FROM `bigquery-public-data.hacker_news.full`
WHERE type = 'story'
GROUP BY 1
ORDER BY story_number DESC
LIMIT 10   

这将为您提供类似

Row domain_name     story_number     
1   github.com      81784    
2   medium.com      71953    
3   youtube.com     58119    
4   blogspot.com    52925    
5   nytimes.com     48986    
6   techcrunch.com  43924    
7   google.com      26326    
8   wordpress.com   23372    
9   arstechnica.com 23162    
10  wired.com       18480    

这篇关于使用SQL从网址中剥离域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆