如何从JavaScript字符串中删除无效的UTF-8字符? [英] How to remove invalid UTF-8 characters from a JavaScript string?

查看:215
本文介绍了如何从JavaScript字符串中删除无效的UTF-8字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从JavaScript中的字符串中删除所有无效的UTF-8字符.我已经尝试过使用此JavaScript:

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:

strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");

strTest = strTest.replace(/([\x00-\x7F]|[\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3})|./g, "$1");

这里(已删除链接)中描述的UTF-8验证正则表达式似乎更完整,我以如下方式进行了修改:

It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:

strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");

strTest = strTest.replace(/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./g, "$1");

这两个代码段似乎都允许通过有效的UTF-8,但几乎没有从我的测试数据中过滤掉任何不良的UTF-8字符:

Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.

我对UTF-8标准或JavaScript中的多字节不是很熟悉,所以我不确定是否无法在正则表达式中表示正确的UTF-8,或者我是否在该正则表达式中使用不当. JavaScript.

I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.

根据Tomalak的评论,在我的正则表达式中添加了全局标志-但这仍然不适用于我.根据bobince的评论,我将放弃在客户端执行此操作.

added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.

推荐答案

我使用了这种简单而坚固的方法:

I use this simple and sturdy approach:

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

基本上,您真正想要的只是ASCII字符0-127,因此只需按char重建字符串char.如果这是一个好的炭,请保留它-如果不是,则将其弃沟.相当强大,如果您的目标是卫生,那么它足够快(事实上,它确实非常快).

Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).

这篇关于如何从JavaScript字符串中删除无效的UTF-8字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆