最低限度的文字卫生 [英] Bare-minimum text sanitation

查看：74 发布时间：2020/4/27 4:04:58 string language-agnostic text sanitization

本文介绍了最低限度的文字卫生的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在接受，存储，处理和显示Unicode文本的应用程序中(出于讨论目的，我们假设它是一个Web应用程序)，应始终从中删除哪些字符收到文字?

In an application that accepts, stores, processes, and displays Unicode text (for the purpose of discussion, let's say that it's a web application), which characters should always be removed from incoming text?

我可以想到一些内容，大部分都列在 C0和C1控制码Wikipedia文章中:

I can think of some, mostly listed in the C0 and C1 control codes Wikipedia article:

范围0x00-0x19(主要是控制字符)，但不包括0x09(制表符)，0x0A(LF)和0x0D(CR)

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

范围0x7F-0x9F(更多控制字符)

The range 0x7F-0x9F (more control characters)

可以安全地接受的字符范围会更好.

Ranges of characters that can safely be accepted would be even better to know.

还有其他级别的文本过滤功能-可以规范化具有多种表示形式的字符，替换不间断字符并删除零宽度字符-但我主要对基础知识感兴趣.

There are other levels of text filtering — one might canonicalize characters that have multiple representations, replace nonbreaking characters, and remove zero-width characters — but I'm mainly interested in the basics.

推荐答案

请参阅W3 Unicode XML和其他标记语言注释.它将一类字符定义为不适合在标记中使用"，我肯定会在大多数网站中将其过滤掉.它特别包括以下字符:

See the W3 Unicode in XML and other markup languages note. It defines a class of characters as ‘discouraged for use in markup’, which I'd definitely filter out for most web sites. It notably includes such characters as:

U + 2028–9是时髦的换行符，如果您尝试在字符串文字中使用它们，将会使JavaScript迷惑；

U+2028–9 which are funky newlines that will confuse JavaScript if you try to use them in a string literal;

U + 202A–E，这是比迪控制代码，用户可以巧妙地插入它们，以使文本在某些浏览器中甚至向后运行，甚至在给定的HTML元素之外；

U+202A–E which are bidi control codes that wily users can insert to make text appear to run backwards in some browsers, even outside of a given HTML element;

语言替代控制代码，它们的范围也可能超出元素；

language override control codes that could also have scope outside of an element;

BOM.

此外，您还想过滤/替换完全在Unicode中无效的字符(U + FFFF等)，并且，如果您使用的是本机可用于UTF-16的语言(例如Java)，，Windows上的Python)，任何不能形成有效代理对的代理字符(U + D800–U + DFFF).

Additionally, you'd want to filter/replace the characters that are not valid in Unicode at all (U+FFFF et al), and, if you are using a language that works in UTF-16 natively (eg. Java, Python on Windows), any surrogate characters (U+D800–U+DFFF) that do not form valid surrogate pairs.

范围0x00-0x19(主要是控制字符)，不包括0x09(制表符)，0x0A(LF)和0x0D(CR)

The range 0x00-0x19 (mostly control characters), excluding 0x09 (tab), 0x0A (LF), and 0x0D (CR)

并且可以说(对于Web应用程序尤其如此)，它也失去了CR，并将制表符变成空格.

And arguably (esp for a web application), lose CR as well, and turn tabs into spaces.

范围0x7F-0x9F(更多控制字符)

The range 0x7F-0x9F (more control characters)

是的，除那些可能真的是真的的人以外，请不要使用它们. (SO以前允许它们使用，允许人们发布被误解码的字符串，这有时对于诊断Unicode问题很有用.)对于大多数网站，我认为您不希望使用它们.

Yep, away with those, except in case where people might really mean them. (SO used to allow them, which allowed people to post strings that had been mis-decoded, which was occasionally useful for diagnosing Unicode problems.) For most sites I think you'd not want them.

这篇关于最低限度的文字卫生的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

最低限度的文字卫生 [英] Bare-minimum text sanitation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

最低限度的文字卫生 [英] Bare-minimum text sanitation

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭