检查javascript字符串是否有效UTF-8 [英] check if javascript string is valid UTF-8

查看:110
本文介绍了检查javascript字符串是否有效UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用户可以复制并粘贴到textarea html输入中,并且有时会粘贴无效的UTF-8字符,例如,从包含标签的rtf文件复制粘贴。



如何检查字符串是否为有效的UTF-8?

解决方案

我想你误解了什么UTF-8 -8个字符的意思。 UTF-8是Unicode的一种编码,它可以表示记录的人类历史中曾经存在过的每一个字符和字形,以至于没有无效的UTF-8字符。



RTF是一种独立于底层编码系统的格式化系统 - 您可以使用ASCII,UTF-8,UTF-16等格式的RTF。 HTML中的文本框只能表示纯文本,所以任何RTF格式都将被自动删除(除非您使用的是丰富编辑组件,而我认为您没有)。



但是你确实描述了诸如空格字符(比如制表符: \t )是以Unicode(UTF-8)表示的。包含这些字符的字符串仍然是有效的UTF-8,就业务需求而言,它是无效的。



我建议只删除不需要的字符使用匹配不可见字符的正则表达式(从这里:匹配非打印/非ASCII字符并从文本中删除

  textBoxContent = textBoxContent.replace(/ [^ \x20-\x7E] + / g,''); 

表达式 [^ \x20-\x7E] 匹配不在代码点范围内的任何字符 0x20 (32,普通空格字符'')到 0x7E (127,tidle '〜'字符),所有其他人将被删除。



Unicode的前127个码点与ASCII相同,可以在这里看到: http:// www.asciitable.com/


A user can copy and paste into a textarea html input and sometimes is pasting invalid UTF-8 characters, for example, a copy and paste from a rtf file that contains tabs.

How can I check if a string is a valid UTF-8?

解决方案

I think you misunderstand what "UTF-8 characters" means. UTF-8 is an encoding of Unicode which can represent pretty-much every single character and glyph that has ever existed in recorded human history, so that extent there are no "invalid" UTF-8 characters.

RTF is a formatting system which works independently of the underlying encoding system - you can use RTF with ASCII, UTF-8, UTF-16 and others. Textboxes in HTML only respect plain text, so any RTF formatting will be automatically stripped (unless you're using a "rich-edit" component, which I assume you're not).

But you do describe things like whitespace characters (like tabs: \t) are represented in Unicode (and so, UTF-8). A string containing those characters is still "valid UTF-8", it's just invalid as far as your business-requirements are concerned.

I suggest just stripping-out unwanted characters using a regular-expression that matches non-visible characters (from here: Match non printable/non ascii characters and remove from text )

textBoxContent = textBoxContent.replace(/[^\x20-\x7E]+/g, '');

The expression [^\x20-\x7E] matches any character NOT in the codepoint range 0x20 (32, a normal space character ' ') to 0x7E (127, the tidle '~' character), all others will be removed.

Unicode's first 127 codepoints are identical to ASCII and can be seen here: http://www.asciitable.com/

这篇关于检查javascript字符串是否有效UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆