从给定的URL获取域名 [英] Get domain name from given url

查看:84
本文介绍了从给定的URL获取域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个URL,我想提取域名(它不应该包含'www'部分)。网址可以包含h​​ttp / https。这是我写的java代码。虽然看起来工作正常,有没有更好的方法,或者是否有一些边缘情况,可能会失败。

Given a URL, I want to extract domain name(It should not include 'www' part). Url can contain http/https. Here is the java code that I wrote. Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

public static String getDomainName(String url) throws MalformedURLException{
    if(!url.startsWith("http") && !url.startsWith("https")){
         url = "http://" + url;
    }        
    URL netUrl = new URL(url);
    String host = netUrl.getHost();
    if(host.startsWith("www")){
        host = host.substring("www".length()+1);
    }
    return host;
}

输入: http://google.com/blah

输出:google.com

Output: google.com

推荐答案

如果要解析URL,请使用 java.net.URI中的 java.net.URL 有一堆问题 - 它的等于方法执行DNS查找,这意味着代码使用它与不受信任的输入一起使用时,可能容易受到拒绝服务攻击。

If you want to parse a URL, use java.net.URI. java.net.URL has a bunch of problems -- its equals method does a DNS lookup which means code using it can be vulnerable to denial of service attacks when used with untrusted inputs.

Gosling先生 - 为什么你让url等于吮吸?解释了一个这样的问题。只是养成使用 java.net.URI 的习惯。

"Mr. Gosling -- why did you make url equals suck?" explains one such problem. Just get in the habit of using java.net.URI instead.

public static String getDomainName(String url) throws URISyntaxException {
    URI uri = new URI(url);
    String domain = uri.getHost();
    return domain.startsWith("www.") ? domain.substring(4) : domain;
}

应该做你想做的事。


虽然看起来工作正常,有没有更好的方法,或者有一些边缘情况,可能会失败。

Though It seems to work fine, is there any better approach or are there some edge cases, that could fail.

您的有效网址代码无效:

Your code as written fails for the valid URLs:


  • httpfoo / bar - 路径组件以 http 开头的相对URL。

  • HTTP://example.com/ - 协议不区分大小写。

  • //example.com / - 与主持人的协议相对网址

  • www / foo - 路径组件的相对URL,以 www

  • wwwexample.com - 域名不以 www。开头,但以 www 开头。

  • httpfoo/bar -- relative URL with a path component that starts with http.
  • HTTP://example.com/ -- protocol is case-insensitive.
  • //example.com/ -- protocol relative URL with a host
  • www/foo -- a relative URL with a path component that starts with www
  • wwwexample.com -- domain name that does not starts with www. but starts with www.

分层URL具有复杂的语法。如果你试图在没有仔细阅读RFC 3986的情况下推出自己的解析器,你可能会弄错它。只需使用内置于核心库中的那个。

Hierarchical URLs have a complex grammar. If you try to roll your own parser without carefully reading RFC 3986, you will probably get it wrong. Just use the one that's built into the core libraries.

如果你真的需要处理 java.net.URI 拒绝,请参阅 RFC 3986 附录B:

If you really need to deal with messy inputs that java.net.URI rejects, see RFC 3986 Appendix B:


附录B.使用正则表达式解析URI引用



由于first-match-wins算法与贪婪
POSIX正则表达式使用的消歧方法,使用正则表达式解析URI引用的
潜在的五个组件是
自然而且常见。

Appendix B. Parsing a URI Reference with a Regular Expression

As the "first-match-wins" algorithm is identical to the "greedy" disambiguation method used by POSIX regular expressions, it is natural and commonplace to use a regular expression for parsing the potential five components of a URI reference.

以下行是用于将
格式良好的URI引用分解为其组件的正则表达式。

The following line is the regular expression for breaking-down a well-formed URI reference into its components.

  ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
   12            3  4          5       6  7        8 9

上面第二行中的数字仅为协助可读性;
它们表示每个子表达式的参考点(即每个
配对括号)。

The numbers in the second line above are only to assist readability; they indicate the reference points for each subexpression (i.e., each paired parenthesis).

这篇关于从给定的URL获取域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆