从给定的URL获取域名 [英] Get domain name from given url
问题描述
给定一个URL,我想提取域名(它不应该包含'www'部分)。网址可以包含http / https。这是我写的java代码。虽然看起来工作正常,有没有更好的方法,或者是否有一些边缘情况,可能会失败。
Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
输出:google.com
Output: google.com
推荐答案
如果要解析URL,请使用 java.net.URI中的
。 java.net.URL
有一堆问题 - 它的等于
方法执行DNS查找,这意味着代码使用它与不受信任的输入一起使用时,可能容易受到拒绝服务攻击。
If you want to parse a URL, use java.net.URI
. java.net.URL
has a bunch of problems -- its equals
method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
Gosling先生 - 为什么你让url等于吮吸?解释了一个这样的问题。只是养成使用 java.net.URI
的习惯。
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI
instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
应该做你想做的事。
虽然看起来工作正常,有没有更好的方法,或者有一些边缘情况,可能会失败。
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
您的有效网址代码无效:
Your code as written fails for the valid URLs:
-
httpfoo / bar
- 路径组件以http
开头的相对URL。 -
HTTP://example.com/
- 协议不区分大小写。 -
//example.com /
- 与主持人的协议相对网址 -
www / foo
- 路径组件的相对URL,以www
-
wwwexample.com
- 域名不以www。
开头,但以www
开头。
httpfoo/bar
-- relative URL with a path component that starts withhttp
.HTTP://example.com/
-- protocol is case-insensitive.//example.com/
-- protocol relative URL with a hostwww/foo
-- a relative URL with a path component that starts withwww
wwwexample.com
-- domain name that does not starts withwww.
but starts withwww
.
分层URL具有复杂的语法。如果你试图在没有仔细阅读RFC 3986的情况下推出自己的解析器,你可能会弄错它。只需使用内置于核心库中的那个。
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
如果你真的需要处理 java.net.URI $的混乱输入c $ c>拒绝,请参阅 RFC 3986 附录B:
If you really need to deal with messy inputs that java.net.URI
rejects, see RFC 3986 Appendix B:
附录B.使用正则表达式解析URI引用
由于first-match-wins算法与贪婪
POSIX正则表达式使用的消歧方法,使用正则表达式解析URI引用的
潜在的五个组件是
自然而且常见。
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.
以下行是用于将
格式良好的URI引用分解为其组件的正则表达式。
The following line is the regular expression for breaking-down a well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
上面第二行中的数字仅为协助可读性;
它们表示每个子表达式的参考点(即每个
配对括号)。
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).
这篇关于从给定的URL获取域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!