R是否有用于解析URL部分的程序包? [英] Does R have any package for parsing out the parts of a URL?

查看：80 发布时间：2020/5/25 0:30:06 r parsing url

本文介绍了R是否有用于解析URL部分的程序包?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个要解析和规范化的URL列表.

I have a list of urls that I would like to parse and normalize.

我希望能够将每个地址分成多个部分，以便我可以将"www.google.com/test/index.asp"和"google.com/somethingelse"标识为来自同一网站.

I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.

Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.

^(?:(?:[[:alpha:]+.-]+)://)?将匹配协议标头(从parse_url()复制)，如果找到它，我们将其剥离
此外，剥去了可能的www.前缀，但未捕获:(?:www\\.)?
直到下一个斜杠的所有内容都是我们的标准主机名，我们捕获该主机名:([^/]+)
其余我们忽略的内容:.*$

^(?:(?:[[:alpha:]+.-]+)://)? will match the protocol header (copied from parse_url()), we are stripping this away if we find it
Also, a potential www. prefix is stripped away, but not captured: (?:www\\.)?
Anything up to the subsequent slash will be our fully qualified host name, which we capture: ([^/]+)
The rest we ignore: .*$

现在，我们将上面的正则表达式连接在一起，主机名的提取将变为:

Now we plug together the regexes above, and the extraction of the hostname becomes:

PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)

更改主机名正则表达式以包括(但不捕获)端口:

Change host name regex to include (but not capture) the port:

HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"

依次类推，直到我们最终得出一个符合RFC的正则表达式，用于解析URL .但是，对于家庭使用，上面的内容就足够了:

And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:

> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
                "http://test.com/?ex"))
[1] "test.server.com" "google.com"      "test.com"

这篇关于R是否有用于解析URL部分的程序包?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R是否有用于解析URL部分的程序包? [英] Does R have any package for parsing out the parts of a URL?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R是否有用于解析URL部分的程序包? [英] Does R have any package for parsing out the parts of a URL?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭