如何使此PHP URL解析功能近乎完美? [英] How to make this PHP URL parsing function nearly perfect?
问题描述
此功能很棒,但主要缺点是它不能处理以.co.uk或.com.au结尾的域.如何对其进行修改以解决此问题?
This function is great, but its main flaw is that it doesn't handle domains ending with .co.uk or .com.au. How can it be modified to handle this?
function parseUrl($url) {
$r = "^(?:(?P<scheme>\w+)://)?";
$r .= "(?:(?P<login>\w+):(?P<pass>\w+)@)?";
$r .= "(?P<host>(?:(?P<subdomain>[-\w\.]+)\.)?" . "(?P<domain>[-\w]+\.(?P<extension>\w+)))";
$r .= "(?::(?P<port>\d+))?";
$r .= "(?P<path>[\w/-]*/(?P<file>[\w-]+(?:\.\w+)?)?)?";
$r .= "(?:\?(?P<arg>[\w=&]+))?";
$r .= "(?:#(?P<anchor>\w+))?";
$r = "!$r!";
preg_match ( $r, $url, $out );
return $out;
}
要弄清我寻找除parse_url()之外的内容的原因,是我也想去除(可能多个)子域.
To clarify my reason for looking for something other than parse_url() is that I want to strip out (possibly multiple) subdomains as well.
print_r(parse_url('sub1.sub2.test.co.uk'));
结果:
Array(
[scheme] => http
[host] => sub1.sub2.test.co.uk
)
我要提取的是"test.co.uk"(无子域),因此首先使用parse_url是毫无意义的额外步骤,其中输出与输入相同.
What I want to extract is "test.co.uk" (sans subdomains), so first using parse_url is a pointless extra step where the output is the same as the input.
推荐答案
This may or may not be of interest, but here's a regex I wrote that mostly conforms to RFC3986 (it's actually slightly stricter, as it disallows some of the more unusual URI syntaxes):
~^(?:(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)?(?P<authority>(?:(?P<userinfo>(?P<username>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?:(?P<password>(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=])*)?|(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:)*?)@)?(?P<host>(?P<domain>(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?\.)+(?:[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?))|(?P<ip>(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d).(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)))(?::(?P<port>\d+))?(?=/|$)))?(?P<path>/?(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/)*(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)+/?)?)(?:\?(?P<query>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*?))?(?:#(?P<fragment>(?:(?:[\w.\~-]|(?:%[\da-f]{2})|[!$&'()*+,;=]|:|@)|/|\?)*))?$~i
命名的组件是:
scheme
authority
userinfo
username
password
domain
ip
path
query
fragment
这是生成它的代码(以及某些选项定义的变体):
And here's the code that generates it (along with variants defined by some options):
public static function validateUri($uri, &$components = false, $flags = 0)
{
if (func_num_args() > 3)
{
$flags = array_slice(func_get_args(), 2);
}
if (is_array($flags))
{
$flagsArray = $flags;
$flags = array();
foreach ($flagsArray as $flag)
{
if (is_int($flag))
{
$flags |= $flag;
}
}
}
// Set options.
$requireScheme = !($flags & self::URI_ALLOW_NO_SCHEME);
$requireAuthority = !($flags & self::URI_ALLOW_NO_AUTHORITY);
$isRelative = (bool)($flags & self::URI_IS_RELATIVE);
$requireMultiPartDomain = (bool)($flags & self::URI_REQUIRE_MULTI_PART_DOMAIN);
// And we're away…
// Some character types (taken from RFC 3986: http://tools.ietf.org/html/rfc3986).
$hex = '[\da-f]'; // Hexadecimal digit.
$pct = "(?:%$hex{2})"; // "Percent-encoded" value.
$gen = '[\[\]:/?#@]'; // Generic delimiters.
$sub = '[!$&\'()*+,;=]'; // Sub-delimiters.
$reserved = "(?:$gen|$sub)"; // Reserved characters.
$unreserved = '[\w.\~-]'; // Unreserved characters.
$pChar = "(?:$unreserved|$pct|$sub|:|@)"; // Path characters.
$qfChar = "(?:$pChar|/|\?)"; // Query/fragment characters.
// Other entities.
$octet = '(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)';
$label = '[a-z](?:[0-9a-z-]*(?:[0-9a-z]))?';
$scheme = '(?:(?P<scheme>[a-z][0-9a-z.+-]*?)://)';
// Authority components.
$userInfo = "(?:(?P<userinfo>(?P<username>(?:$unreserved|$pct|$sub)*)?:(?P<password>(?:$unreserved|$pct|$sub)*)?|(?:$unreserved|$pct|$sub|:)*?)@)?";
$ip = "(?P<ip>$octet.$octet.$octet.$octet)";
if ($requireMultiPartDomain)
{
$domain = "(?P<domain>(?:$label\.)+(?:$label))";
}
else
{
$domain = "(?P<domain>(?:$label\.)*(?:$label))";
}
$host = "(?P<host>$domain|$ip)";
$port = '(?::(?P<port>\d+))?';
// Primary hierarchical URI components.
$authority = "(?P<authority>$userInfo$host$port(?=/|$))";
$path = "(?P<path>/?(?:$pChar+/)*(?:$pChar+/?)?)";
// Final bits.
$query = "(?:\?(?P<query>$qfChar*?))?";
$fragment = "(?:#(?P<fragment>$qfChar*))?";
// Construct the final pattern.
$pattern = '~^';
// Only include scheme and authority if the path is not relative.
if (!$isRelative)
{
if ($requireScheme)
{
// If the scheme is required, then the authority must also be there.
$pattern .= $scheme . $authority;
}
else if ($requireAuthority)
{
$pattern .= "$scheme?$authority";
}
else
{
$pattern .= "(?:$scheme?$authority)?";
}
}
else
{
// Disallow that optional slash we put in $path.
$pattern .= '(?!/)';
}
// Now add standard elements and terminate the pattern.
$pattern .= $path . $query . $fragment . '$~i';
// Finally, validate that sucker!
$components = array();
$result = (bool)preg_match($pattern, $uri, $matches);
if ($result)
{
// Filter out all of the useless numerical matches.
foreach ($matches as $key => $value)
{
if (!is_int($key))
{
$components[$key] = $value;
}
}
return true;
}
else
{
return false;
}
}
这篇关于如何使此PHP URL解析功能近乎完美?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!