如何验证国际化域名 [英] how to validate internationalized domain names

查看:88
本文介绍了如何验证国际化域名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想验证php中的域名URL,它可能是国际化域名格式,例如希腊文 域名= http://παράδειγμα.δοκιμή 他们有使用正则表达式验证它的任何方法吗?

I want to validate the domain url in php which may be in internationalized domain name format like in greek domain name= http://παράδειγμα.δοκιμή Is their any way to validate it using regular expression?

推荐答案

如果要创建自己的库,则需要使用允许的代码点表( UNIDATA/Scripts.txt ).

If you want to create your own libirary, you need to use the table of permitted codepoints (IANA — Repository of IDN Practices, IDN Character Validation Guidance, IDNA Parameters) and the table of Unicode Script properties (UNIDATA/Scripts.txt).

Gmail采用Unicode联盟的"H 严格限制"规范(在全球范围内保护Gmail ). 允许以下Unicode脚本的组合.

Gmail adopt the Unicode Consortium’s "Highly Restricted" specification (Protecting Gmail in a global world). The following comibinations of Unicode Scripts are permitted.

  • 单个脚本
  • 拉丁语+汉语+平假名+片假名
  • 拉丁语+汉语+ Bopomofo
  • 拉丁语+汉语+朝鲜语

您可能需要注意特殊的脚本属性值(公共",继承",未知"),因为某些字符具有多个属性或错误的属性.

You may need to pay attention to special script property values (Common, Inherited, Unknown) since some of characters has multiple properties or wrong properties.

例如,U + 3099(结合片假名-平假名语音标记)具有两个属性(片假名"和平假名"),PCRE函数将其归类为继承".另一个示例是U + x2A708. U + 2A708(U + 30C8片假名TO和U + 30E2片假名MO的组合)的正确脚本属性是片假名",Unicode规范将其误分类为汉".

For example, U+3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two proierties ( "Katakana" and "Hiragana") and PCRE function classify it as "Inherited". Another example is U+x2A708. Althogh the right script property of U+2A708(comibination of U+30C8 KATAKANA LETTER TO and U+30E2 KATAKANA LETTER MO) is "Katakana", The Unicode Specification misclassify it as "Han".

您可能需要考虑 IDN同形异义词攻击. Google Chrome的 IDN政策采用

You may need to consider IDN homograph attack. Google Chrome's IDN policy adopts the blacklist chars.

我的建议是使用Zend \ Validator \ Hostname.该库使用允许的代码点表日语和中文.

My recommendation is to use Zend\Validator\Hostname. This library uses the table of permitted code points for Japanese and Chinese.

如果您使用Symfony,请考虑将应用程序升级到采用egulias/email-validatornd(手册). 您是否需要额外验证字符串是否为格式正确的字节序列.详情请参阅我的报告 a>.

If you use Symfony, consider upgrade the app of version to 2.5 which adopts egulias/email-validatornd (Manual). You need extra validation whether the string is well-formed byte sequense. See my reporta> for the detail.

别忘了XSS和SQL注入.以下地址是基于RFC5322的有效电子邮件地址.

Don't forget XSS and SQL injection. The following address is valid email address based RFC5322.

// From Japanese tutorial
// http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html
"><script>alert('or/**/1=1#')</script>"@example.jp

我认为使用idn_to_ascii进行验证值得怀疑,因为idn_to_ascii几乎传递了所有字符.

I think it's doubtful for using idn_to_ascii for validation since idn_to_ascii passes almost all characters.

for ($i = 0; $i < 0x110000; ++$i) {
    $c = utf8_chr($i);

    if ($c !== '' && false !== idn_to_ascii($c)) {
        $number = strtoupper(dechex($i));
        $length = strlen($number);

        if ($i < 0x10000) {
            $number = str_repeat('0', 4 - $length).$number;
        }

        $idn = $c.'example.com';

        echo 'U+'.$number.' ';
        echo ' '.$idn.' '. idn_to_ascii($idn);
        echo PHP_EOL;
    }
}

function utf8_chr($code_point) {

    if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
        return '';
    }

    if ($code_point < 0x80) {
        $hex[0] = $code_point;
        $ret = chr($hex[0]);
    } else if ($code_point < 0x800) {
        $hex[0] = 0x1C0 | $code_point >> 6;
        $hex[1] = 0x80  | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]);
    } else if ($code_point < 0x10000) {
        $hex[0] = 0xE0 | $code_point >> 12;
        $hex[1] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[2] = 0x80 | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
    } else  {
        $hex[0] = 0xF0 | $code_point >> 18;
        $hex[1] = 0x80 | $code_point >> 12 & 0x3F;
        $hex[2] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[3] = 0x80 | $code_point  & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
    }

    return $ret;
}

如果要通过Unicode脚本属性验证域,请使用PCRE函数.

If you want to validate domain by Unicode Script properties, use PCRE functions.

以下代码显示如何获取Unicode脚本属性的名称.如果要使用JavaScript检查Unicode脚本的性能,请使用 mathiasbynens/unicode-data .

The following code show how to get tne name of Unicode script property. If you want to che the Unicode Script peroperties in JavaScript, use mathiasbynens/unicode-data.

function get_unicode_script_name($c) {

  // http://php.net/manual/regexp.reference.unicode.php
  $names = [
    'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 
    'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal',
    'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform',
    'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs',
    'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 
    'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic',
    'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese',
    'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin',
    'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic',
    'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian',
    'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian',
    'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa',
    'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian',
    'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog',
    'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana',
    'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi'
  ];

  $ret = [];

  foreach ($names as $name) {

    $pattern = '/\p{'.$name.'}/u';

    if (preg_match($pattern, $c)) {
        return $name;
    }
  }

  return '';
}

这篇关于如何验证国际化域名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆