从R中的url返回根域 [英] Return root domain from url in R
问题描述
给出网站地址,例如
http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2
如何在R
中返回根域,例如
How do I return the root domain in R
, e.g.
example.com
example2.co.uk
出于我的目的,我将定义具有结构的根域
For my purposes I would define the root domain to have structure
example_name.public_suffix
其中example_name排除"www"并且public_suffix在列表中:
where example_name excludes "www" and public_suffix is on the list here:
https://publicsuffix.org/list/effective_tld_names.dat
这仍然是最好的基于正则表达式的解决方案吗?
Is this still the best regex based solution:
https://stackoverflow.com/a/8498629/2109289
R
中基于公共后缀列表解析根域的内容如何处理?
What about something in R
that parses root domain based off the public suffix list, something like:
http://simonecarletti.com/code/publicsuffix/
使用XML::parseURI
似乎返回第一个"//"之间的内容.和"/".例如
Using XML::parseURI
seems to return the stuff between the first "//" and "/". e.g.
> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"
因此,问题减少到具有R
函数,该函数可以从URI返回公共后缀,或在公共后缀列表上实现以下算法:
Thus, the question reduces to having an R
function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:
- 将域与所有规则匹配,并记下匹配的规则.
- 如果没有规则匹配,则主要规则为"*".
- 如果有多个规则匹配,则主要规则是例外规则.
- 如果没有匹配的例外规则,则主要规则是标签数最多的规则.
- 如果通用规则是例外规则,请通过删除最左边的标签对其进行修改.
- 公共后缀是域中与现行规则的标签直接匹配(由点连接)的一组标签.
- 注册的或可注册的域是公共后缀,外加一个附加标签.
- Match domain against all rules and take note of the matching ones.
- If no rules match, the prevailing rule is "*".
- If more than one rule matches, the prevailing rule is the one which is an exception rule.
- If there is no matching exception rule, the prevailing rule is the one with the most labels.
- If the prevailing rule is a exception rule, modify it by removing the leftmost label.
- The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
- The registered or registrable domain is the public suffix plus one additional label.
推荐答案
这里有两个任务.首先是解析URL以获取主机名,可以使用 httr 包的parse_url
函数:
There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url
function:
host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"
第二个是提取组织域(或根域,顶级私有域-无论您要调用什么域).可以使用 tldextract 包(受同名的Python包启发,并使用Mozilla的公共后缀列表):
The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):
domain.info <- tldextract(host)
domain.info
# host subdomain domain tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk
tldextract
返回一个数据框,为您提供的每个域都带有一行,但是您可以轻松地将相关部分粘贴在一起:
tldextract
returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:
paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"
这篇关于从R中的url返回根域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!