从R中的url返回根域 [英] Return root domain from url in R

查看:152
本文介绍了从R中的url返回根域的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出网站地址,例如

http://www.example.com/page1/#
https://subdomain.example2.co.uk/asdf?retrieve=2

如何在R中返回根域,例如

How do I return the root domain in R, e.g.

example.com
example2.co.uk

出于我的目的,我将定义具有结构的根域

For my purposes I would define the root domain to have structure

example_name.public_suffix

其中example_name排除"www"并且public_suffix在列表中:

where example_name excludes "www" and public_suffix is on the list here:

https://publicsuffix.org/list/effective_tld_names.dat

这仍然是最好的基于正则表达式的解决方案吗?

Is this still the best regex based solution:

https://stackoverflow.com/a/8498629/2109289

R中基于公共后缀列表解析根域的内容如何处理?

What about something in R that parses root domain based off the public suffix list, something like:

http://simonecarletti.com/code/publicsuffix/

使用XML::parseURI似乎返回第一个"//"之间的内容.和"/".例如

Using XML::parseURI seems to return the stuff between the first "//" and "/". e.g.

> parseURI("http://www.blog.omegahat.org:8080/RCurl/index.html")$server
[1] "www.blog.omegahat.org"

因此,问题减少到具有R函数,该函数可以从URI返回公共后缀,或在公共后缀列表上实现以下算法:

Thus, the question reduces to having an R function that can return the public suffix from the URI, or implementing the following algorithm on the public suffix list:

  • 将域与所有规则匹配,并记下匹配的规则.
  • 如果没有规则匹配,则主要规则为"*".
  • 如果有多个规则匹配,则主要规则是例外规则.
  • 如果没有匹配的例外规则,则主要规则是标签数最多的规则.
  • 如果通用规则是例外规则,请通过删除最左边的标签对其进行修改.
  • 公共后缀是域中与现行规则的标签直接匹配(由点连接)的一组标签.
  • 注册的或可注册的是公共后缀,外加一个附加标签.
  • Match domain against all rules and take note of the matching ones.
  • If no rules match, the prevailing rule is "*".
  • If more than one rule matches, the prevailing rule is the one which is an exception rule.
  • If there is no matching exception rule, the prevailing rule is the one with the most labels.
  • If the prevailing rule is a exception rule, modify it by removing the leftmost label.
  • The public suffix is the set of labels from the domain which directly match the labels of the prevailing rule (joined by dots).
  • The registered or registrable domain is the public suffix plus one additional label.

推荐答案

这里有两个任务.首先是解析URL以获取主机名,可以使用 httr 包的parse_url函数:

There are two tasks here. The first is parsing the URL to get the host name, which can be done with the httr package's parse_url function:

host <- parse_url("https://subdomain.example2.co.uk/asdf?retrieve=2")$hostname
host
# [1] "subdomain.example2.co.uk"

第二个是提取组织域(或根域,顶级私有域-无论您要调用什么域).可以使用 tldextract 包(受同名的Python包启发,并使用Mozilla的公共后缀列表):

The second is extracting the organizational domain (or root domain, top private domain--whatever you want to call it). This can be done using the tldextract package (which is inspired by the Python package of the same name and uses Mozilla's public suffix list):

domain.info <- tldextract(host)
domain.info
#                       host subdomain   domain   tld
# 1 subdomain.example2.co.uk subdomain example2 co.uk

tldextract返回一个数据框,为您提供的每个域都带有一行,但是您可以轻松地将相关部分粘贴在一起:

tldextract returns a data frame, with a row for each domain you give it, but you can easily paste together the relevant parts:

paste(domain.info$domain, domain.info$tld, sep=".")
# [1] "example2.co.uk"

这篇关于从R中的url返回根域的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆