最低限度的文字卫生 [英] Bare-minimum text sanitation

查看:74
本文介绍了最低限度的文字卫生的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在接受,存储,处理和显示Unicode文本的应用程序中(出于讨论目的,我们假设它是一个Web应用程序),应始终从中删除哪些字符 收到文字?

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?

我可以想到一些内容,大部分都列在 C0和C1控制码Wikipedia文章中:

I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:

  1. 范围0x00-0x19(主要是控制字符),但不包括0x09(制表符),0x0A(LF)和0x0D(CR)

  1. The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

范围0x7F-0x9F(更多控制字符)

The range 0x7F-0x9F (more control characters)

可以安全地接受的字符范围会更好.

Ranges of characters that can safely be accepted would be even better to know.

还有其他级别的文本过滤功能-可以规范化具有多种表示形式的字符,替换不间断字符并删除零宽度字符-但我主要对基础知识感兴趣.

There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.

推荐答案

请参阅W3 Unicode XML和其他标记语言注释.它将一类字符定义为不适合在标记中使用",我肯定会在大多数网站中将其过滤掉.它特别包括以下字符:

See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:

  • U + 2028–9是时髦的换行符,如果您尝试在字符串文字中使用它们,将会使JavaScript迷惑;

  • U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;

U + 202A–E,这是比迪控制代码,用户可以巧妙地插入它们,以使文本在某些浏览器中甚至向后运行,甚至在给定的HTML元素之外;

U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;

语言替代控制代码,它们的范围也可能超出元素;

language override control codes that could also have scope outside of an element;

BOM.

此外,您还想过滤/替换完全在Unicode中无效的字符(U + FFFF等),并且,如果您使用的是本机可用于UTF-16的语言(例如Java), ,Windows上的Python),任何不能形成有效代理对的代理字符(U + D800–U + DFFF).

Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.

范围0x00-0x19(主要是控制字符),不包括0x09(制表符),0x0A(LF)和0x0D(CR)

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

并且可以说(对于Web应用程序尤其如此),它也失去了CR,并将制表符变成空格.

And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.

范围0x7F-0x9F(更多控制字符)

The range 0x7F-0x9F (more control characters)

是的,除那些可能真的是真的的人以外,请不要使用它们. (SO以前允许它们使用,允许人们发布被误解码的字符串,这有时对于诊断Unicode问题很有用.)对于大多数网站,我认为您不希望使用它们.

Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.

这篇关于最低限度的文字卫生的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆