R是否有用于解析URL部分的程序包? [英] Does R have any package for parsing out the parts of a URL?
问题描述
我有一个要解析和规范化的URL列表.
I have a list of urls that I would like to parse and normalize.
我希望能够将每个地址分成多个部分,以便我可以将"www.google.com/test/index.asp"和"google.com/somethingelse"标识为来自同一网站.
I'd like to be able to split each address into parts so that I can identify "www.google.com/test/index.asp" and "google.com/somethingelse" as being from the same website.
推荐答案
由于让我们看看. URL由一个协议,一个"netloc"(可能包括用户名,密码,主机名和端口组件)以及一个我们很高兴将其剥离的其余部分组成.首先假设没有用户名,密码或端口.
Let's see. A URL consists of a protocol, a "netloc" which may include username, password, hostname and port components, and a remainder which we happily strip away. Let's assume first there's no username nor password nor port.
-
^(?:(?:[[:alpha:]+.-]+)://)?
将匹配协议标头(从parse_url()
复制),如果找到它,我们将其剥离 - 此外,剥去了可能的
www.
前缀,但未捕获:(?:www\\.)?
- 直到下一个斜杠的所有内容都是我们的标准主机名,我们捕获该主机名:
([^/]+)
- 其余我们忽略的内容:
.*$
^(?:(?:[[:alpha:]+.-]+)://)?
will match the protocol header (copied fromparse_url()
), we are stripping this away if we find it- Also, a potential
www.
prefix is stripped away, but not captured:(?:www\\.)?
- Anything up to the subsequent slash will be our fully qualified host name, which we capture:
([^/]+)
- The rest we ignore:
.*$
现在,我们将上面的正则表达式连接在一起,主机名的提取将变为:
Now we plug together the regexes above, and the extraction of the hostname becomes:
PROTOCOL_REGEX <- "^(?:(?:[[:alpha:]+.-]+)://)?"
PREFIX_REGEX <- "(?:www\\.)?"
HOSTNAME_REGEX <- "([^/]+)"
REST_REGEX <- ".*$"
URL_REGEX <- paste0(PROTOCOL_REGEX, PREFIX_REGEX, HOSTNAME_REGEX, REST_REGEX)
domain.name <- function(urls) gsub(URL_REGEX, "\\1", urls)
更改主机名正则表达式以包括(但不捕获)端口:
Change host name regex to include (but not capture) the port:
HOSTNAME_REGEX <- "([^:/]+)(?::[0-9]+)?"
依次类推,直到我们最终得出一个符合RFC的正则表达式,用于解析URL .但是,对于家庭使用,上面的内容就足够了:
And so forth and so on, until we finally arrive at an RFC-compliant regular expression for parsing URLs. However, for home use, the above should suffice:
> domain.name(c("test.server.com/test", "www.google.com/test/index.asp",
"http://test.com/?ex"))
[1] "test.server.com" "google.com" "test.com"
这篇关于R是否有用于解析URL部分的程序包?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!