R是否有用于解析URL部分的程序包? [英] Does R have any package for parsing out the parts of a URL?

查看:80
本文介绍了R是否有用于解析URL部分的程序包?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个要解析和规范化的URL列表.

I have a list of urls that I would like to parse and normalize.

我希望能够将每个地址分成多个部分,以便我可以将"www.google.com/test/index.asp"和"google.com/somethingelse"标识为来自同一网站.

I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.

推荐答案

由于让我们看看. URL由一个协议,一个"netloc"(可能包括用户名,密码,主机名和端口组件)以及一个我们很高兴将其剥离的其余部分组成.首先假设没有用户名,密码或端口.

Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

  • ^(?:(?:[[:alpha:]+.-]+)://)?将匹配协议标头(从parse_url()复制),如果找到它,我们将其剥离
  • 此外,剥去了可能的www.前缀,但未捕获:(?:www\\.)?
  • 直到下一个斜杠的所有内容都是我们的标准主机名,我们捕获该主机名:([^/]+)
  • 其余我们忽略的内容:.*$
  • ^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
  • Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
  • Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
  • The rest we ignore: .*$

现在,我们将上面的正则表达式连接在一起,主机名的提取将变为:

Now we plug together the regexes above, and the extraction of the hostname becomes:

PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)

更改主机名正则表达式以包括(但不捕获)端口:

Change host name regex to include (but not capture) the port:

HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"

依次类推,直到我们最终得出一个符合RFC的正则表达式,用于解析URL .但是,对于家庭使用,上面的内容就足够了:

And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                "http://test.com/?ex"))
[1] "test.server.com" "google.com"      "test.com"       

这篇关于R是否有用于解析URL部分的程序包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆