如何从 JavaScript 字符串中删除无效的 UTF-8 字符? [英] How to remove invalid UTF-8 characters from a JavaScript string?

查看:36
本文介绍了如何从 JavaScript 字符串中删除无效的 UTF-8 字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 JavaScript 的字符串中删除所有无效的 UTF-8 字符.我已经尝试过使用这个 JavaScript:

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:

strTest = strTest.replace(/([x00-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})|./g, "$1");

似乎此处描述的 UTF-8 验证正则表达式 (链接已删除) 更完整,我以相同的方式对其进行了修改,例如:

It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:

strTest = strTest.replace(/([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./g, "$1");

这两段代码似乎都允许有效的 UTF-8 通过,但几乎没有从我的测试数据中过滤掉任何错误的 UTF-8 字符:UTF-8 解码器能力和压力测试.坏字符要么保持不变,要么似乎删除了一些字节,从而创建了一个新的无效字符.

Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.

我对 UTF-8 标准或 JavaScript 中的多字节不是很熟悉,所以我不确定我是否未能在正则表达式中表示正确的 UTF-8,或者我是否在应用该正则表达式时不正确JavaScript.

I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.

根据 Tomalak 的评论在我的正则表达式中添加了全局标志 - 但是这对我来说仍然不起作用.根据 bobince 的评论,我将放弃在客户端执行此操作.

added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.

推荐答案

我使用这种简单而可靠的方法:

I use this simple and sturdy approach:

function cleanString(input) {
    var output = "";
    for (var i=0; i<input.length; i++) {
        if (input.charCodeAt(i) <= 127) {
            output += input.charAt(i);
        }
    }
    return output;
}

基本上所有你真正想要的是 ASCII 字符 0-127,所以只需按字符重建字符串字符.如果它是一个好字符,请保留它 - 如果不是,则放弃它.非常强大,如果您的目标是卫生,那么它已经足够快(实际上它真的很快).

Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).

这篇关于如何从 JavaScript 字符串中删除无效的 UTF-8 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆