检查javascript字符串是否有效UTF-8 [英] check if javascript string is valid UTF-8
问题描述
用户可以复制并粘贴到textarea html输入中,并且有时会粘贴无效的UTF-8字符,例如,从包含标签的rtf文件复制粘贴。
如何检查字符串是否为有效的UTF-8?
解决方案我想你误解了什么UTF-8 -8个字符的意思。 UTF-8是Unicode的一种编码,它可以表示记录的人类历史中曾经存在过的每一个字符和字形,以至于没有无效的UTF-8字符。
RTF是一种独立于底层编码系统的格式化系统 - 您可以使用ASCII,UTF-8,UTF-16等格式的RTF。 HTML中的文本框只能表示纯文本,所以任何RTF格式都将被自动删除(除非您使用的是丰富编辑组件,而我认为您没有)。
但是你确实描述了诸如空格字符(比如制表符: \t
)是以Unicode(UTF-8)表示的。包含这些字符的字符串仍然是有效的UTF-8,就业务需求而言,它是无效的。
我建议只删除不需要的字符使用匹配不可见字符的正则表达式(从这里:匹配非打印/非ASCII字符并从文本中删除)
textBoxContent = textBoxContent.replace(/ [^ \x20-\x7E] + / g,'');
表达式 [^ \x20-\x7E]
匹配不在代码点范围内的任何字符 0x20
(32,普通空格字符''
)到 0x7E
(127,tidle '〜'
字符),所有其他人将被删除。
Unicode的前127个码点与ASCII相同,可以在这里看到: http:// www.asciitable.com/
A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.
How can I check if a string is a valid UTF-8?
I think you misunderstand what "UTF-8 characters" means. UTF-8 is an encoding of Unicode which can represent pretty-much every single character and glyph that has ever existed in recorded human history, so that extent there are no "invalid" UTF-8 characters.
RTF is a formatting system which works independently of the underlying encoding system - you can use RTF with ASCII, UTF-8, UTF-16 and others. Textboxes in HTML only respect plain text, so any RTF formatting will be automatically stripped (unless you're using a "rich-edit" component, which I assume you're not).
But you do describe things like whitespace characters (like tabs: \t
) are represented in Unicode (and so, UTF-8). A string containing those characters is still "valid UTF-8", it's just invalid as far as your business-requirements are concerned.
I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )
textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');
The expression [^\x20-\x7E]
matches any character NOT in the codepoint range 0x20
(32, a normal space character ' '
) to 0x7E
(127, the tidle '~'
character), all others will be removed.
Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/
这篇关于检查javascript字符串是否有效UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!