如何从 JavaScript 字符串中删除无效的 UTF-8 字符? [英] How to remove invalid UTF-8 characters from a JavaScript string?
问题描述
我想从 JavaScript 的字符串中删除所有无效的 UTF-8 字符.我已经尝试过使用这个 JavaScript:
I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:
strTest = strTest.replace(/([x00-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})|./g, "$1");
似乎此处描述的 UTF-8 验证正则表达式 (链接已删除) 更完整,我以相同的方式对其进行了修改,例如:
It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:
strTest = strTest.replace(/([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./g, "$1");
这两段代码似乎都允许有效的 UTF-8 通过,但几乎没有从我的测试数据中过滤掉任何错误的 UTF-8 字符:UTF-8 解码器能力和压力测试.坏字符要么保持不变,要么似乎删除了一些字节,从而创建了一个新的无效字符.
Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.
我对 UTF-8 标准或 JavaScript 中的多字节不是很熟悉,所以我不确定我是否未能在正则表达式中表示正确的 UTF-8,或者我是否在应用该正则表达式时不正确JavaScript.
I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.
根据 Tomalak 的评论在我的正则表达式中添加了全局标志 - 但是这对我来说仍然不起作用.根据 bobince 的评论,我将放弃在客户端执行此操作.
added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.
推荐答案
我使用这种简单而可靠的方法:
I use this simple and sturdy approach:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
基本上所有你真正想要的是 ASCII 字符 0-127,所以只需按字符重建字符串字符.如果它是一个好字符,请保留它 - 如果不是,则放弃它.非常强大,如果您的目标是卫生,那么它已经足够快(实际上它真的很快).
Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).
这篇关于如何从 JavaScript 字符串中删除无效的 UTF-8 字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!