encodeURIComponent 抛出异常 [英] encodeURIComponent throws an exception

查看:34
本文介绍了encodeURIComponent 抛出异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用用户提供的输入在 encodeURIComponent 函数的帮助下以编程方式构建 URI.但是,当用户输入无效的 unicode 字符(例如 U+DFFF)时,该函数会抛出异常并显示以下消息:

I am programmatically building a URI with the help of the encodeURIComponent function using user provided input. However, when the user enters invalid unicode characters (such as U+DFFF), the function throws an exception with the following message:

要编码的 URI 包含无效字符

The URI to be encoded contains an invalid character

我在 MSDN,但这并没有告诉我任何我不知道的事情.

I looked this up on MSDN, but that didn't tell me anything I didn't already know.

纠正这个错误

  • 确保要编码的字符串仅包含有效的 Unicode 序列.

我的问题是,在将用户提供的输入传递给 encodeURIComponent 函数之前,是否有办法对其进行清理以删除所有无效的 Unicode 序列?

My question is, is there a way to sanitize the user provided input to remove all invalid Unicode sequences before I pass it on to the encodeURIComponent function?

推荐答案

采用程序化方法来发现答案,唯一出现问题的范围是 \ud800-\udfff,即高代理和低代理的范围:

Taking the programmatic approach to discover the answer, the only range that turned up any problems was \ud800-\udfff, the range for high and low surrogates:

for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        if (firstI !== null) {
            if (i === lastI + 1) {
                lastI++;
            }
            else if (firstI === lastI) {
                regex += '\\u' + firstI.toString(16);
                firstI = lastI = i; 
            }
            else {
                regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
                firstI = lastI = i; 
            }
        }
        else {
            firstI = i;
            lastI = i;
        }        
    }
}

if (firstI === lastI) {
    regex += '\\u' + firstI.toString(16);
}
else {
    regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
}
regex += ']/';
alert(regex);  // /[\ud800-\udfff]/

然后我用一个更简单的例子证实了这一点:

I then confirmed this with a simpler example:

for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        alert(e); // Doesn't alert
    }
}
alert('ok!');

这与 MSDN 所说的相符,因为除了代理项之外,所有这些 Unicode 字符(甚至有效的 Unicode非字符")都是有效的 Unicode 序列.

And this fits with what MSDN says because indeed all those Unicode characters (even valid Unicode "non-characters") besides surrogates are all valid Unicode sequences.

您确实可以过滤掉高低代理,但是当在高低对中使用时,它们变得合法(因为它们旨在以这种方式使用以允许 Unicode 扩展(急剧)超出其原始最大值字符数):

You can indeed filter out high and low surrogates, but when used in a high-low pair, they become legitimate (as they are meant to be used in this way to allow for Unicode to expand (drastically) beyond its original maximum number of characters):

alert(encodeURIComponent('\uD800\uDC00')); // ok
alert(encodeURIComponent('\uD800')); // not ok
alert(encodeURIComponent('\uDC00')); // not ok either

因此,如果您想走简单的路线并阻止代理,只需:

So, if you want to take the easy route and block surrogates, it is just a matter of:

urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');

如果您想在允许代理对(合法序列但很少需要字符)的同时去除不匹配(无效)的代理,您可以执行以下操作:

If you want to strip out unmatched (invalid) surrogates while allowing surrogate pairs (which are legitimate sequences but the characters are rarely ever needed), you can do the following:

function stripUnmatchedSurrogates (str) {
    return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
}

var urlPart = '\uD801 \uD801\uDC00 \uDC01'
alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single non-BMP character)

如果 JavaScript 有负向后视功能,那么这个函数就不会那么丑了...

If JavaScript had negative lookbehind the function would be a lot less ugly...

这篇关于encodeURIComponent 抛出异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆