从给定的 url 获取域名 [英] Get domain name from given url

查看:24
本文介绍了从给定的 url 获取域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个 URL,我想提取域名(它不应该包含 'www' 部分).网址可以包含 http/https.这是我写的java代码.虽然它似乎工作正常,但有没有更好的方法或是否有一些边缘情况,这可能会失败.

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

public static String getDomainName(String url) throws MalformedURLException{
    if(!url.startsWith("http") && !url.startsWith("https")){
         url = "http://" + url;
    }        
    URL netUrl = new URL(url);
    String host = netUrl.getHost();
    if(host.startsWith("www")){
        host = host.substring("www".length()+1);
    }
    return host;
}

输入:http://google.com/blah

输出:google.com

Output: google.com

推荐答案

如果要解析 URL,请使用 java.net.URI.java.net.URL 有很多问题——它的 equals 方法进行 DNS 查找,这意味着使用它的代码在与不受信任的环境一起使用时可能容易受到拒绝服务攻击输入.

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

"高斯林先生 --为什么你让 url equals 很糟糕?" 解释了一个这样的问题.只需养成使用 java.net.URI 的习惯即可.

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

public static String getDomainName(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;
}

应该做你想做的.

虽然它似乎工作得很好,但有没有更好的方法或者是否有一些边缘情况,这可能会失败.

Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

您编写的代码对于有效 URL 失败:

Your code as written fails for the valid URLs:

  • httpfoo/bar -- 带有以 http 开头的路径组件的相对 URL.
  • HTTP://example.com/ -- 协议不区分大小写.
  • //example.com/ -- 与主机的协议相对 URL
  • www/foo -- 具有以 www
  • 开头的路径组件的相对 URL
  • wwwexample.com -- 不以 www. 开头但以 www 开头的域名.
  • httpfoo/bar -- relative URL with a path component that starts with http.
  • HTTP://example.com/ -- protocol is case-insensitive.
  • //example.com/ -- protocol relative URL with a host
  • www/foo -- a relative URL with a path component that starts with www
  • wwwexample.com -- domain name that does not starts with www. but starts with www.

分层 URL 具有复杂的语法.如果您在没有仔细阅读 RFC 3986 的情况下尝试推出自己的解析器,您可能会弄错.只需使用核心库中内置的那个.

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

如果您确实需要处理 java.net.URI 拒绝的杂乱输入,请参阅 RFC 3986 附录 B:

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:

由于首胜"算法与贪婪"算法相同POSIX 正则表达式使用的消歧方法,它是使用正则表达式来解析URI 引用的潜在五个组成部分.

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

下面一行是分解一个的正则表达式格式良好的 URI 引用到其组件中.

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9

上面第二行的数字只是为了便于阅读;它们表示每个子表达式的参考点(即,每个成对括号).

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).

这篇关于从给定的 url 获取域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆