从给定的 url 获取域名 [英] Get domain name from given url
问题描述
给定一个 URL,我想提取域名(它不应该包含 'www' 部分).网址可以包含 http/https.这是我写的java代码.虽然它似乎工作正常,但有没有更好的方法或是否有一些边缘情况,这可能会失败.
Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
public static String getDomainName(String url) throws MalformedURLException{
if(!url.startsWith("http") && !url.startsWith("https")){
url = "http://" + url;
}
URL netUrl = new URL(url);
String host = netUrl.getHost();
if(host.startsWith("www")){
host = host.substring("www".length()+1);
}
return host;
}
输出:google.com
Output: google.com
推荐答案
如果要解析 URL,请使用 java.net.URI
.java.net.URL
有很多问题——它的 equals
方法进行 DNS 查找,这意味着使用它的代码在与不受信任的环境一起使用时可能容易受到拒绝服务攻击输入.
If you want to parse a URL, use java.net.URI
. java.net.URL
has a bunch of problems -- its equals
method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.
"高斯林先生 --为什么你让 url equals 很糟糕?" 解释了一个这样的问题.只需养成使用 java.net.URI
的习惯即可.
"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI
instead.
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
应该做你想做的.
虽然它似乎工作得很好,但有没有更好的方法或者是否有一些边缘情况,这可能会失败.
Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.
您编写的代码对于有效 URL 失败:
Your code as written fails for the valid URLs:
httpfoo/bar
-- 带有以http
开头的路径组件的相对 URL.HTTP://example.com/
-- 协议不区分大小写.//example.com/
-- 与主机的协议相对 URLwww/foo
-- 具有以www
开头的路径组件的相对 URLwwwexample.com
-- 不以www.
开头但以www
开头的域名.
httpfoo/bar
-- relative URL with a path component that starts withhttp
.HTTP://example.com/
-- protocol is case-insensitive.//example.com/
-- protocol relative URL with a hostwww/foo
-- a relative URL with a path component that starts withwww
wwwexample.com
-- domain name that does not starts withwww.
but starts withwww
.
分层 URL 具有复杂的语法.如果您在没有仔细阅读 RFC 3986 的情况下尝试推出自己的解析器,您可能会弄错.只需使用核心库中内置的那个.
Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.
如果您确实需要处理 java.net.URI
拒绝的杂乱输入,请参阅 RFC 3986 附录 B:
If you really need to deal with messy inputs that java.net.URI
rejects, see RFC 3986 Appendix B:
由于首胜"算法与贪婪"算法相同POSIX 正则表达式使用的消歧方法,它是使用正则表达式来解析URI 引用的潜在五个组成部分.
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.
下面一行是分解一个的正则表达式格式良好的 URI 引用到其组件中.
The following line is the regular expression for breaking-down a well-formed URI reference into its components.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
上面第二行的数字只是为了便于阅读;它们表示每个子表达式的参考点(即,每个成对括号).
The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).
这篇关于从给定的 url 获取域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!