如何防止...变音符号? (又名的Unicode组合字符/ Zalgo文本) [英] How to protect against... diacritics? (aka Unicode combining characters / Zalgo text)

查看:596
本文介绍了如何防止...变音符号? (又名的Unicode组合字符/ Zalgo文本)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



上图中的字符是啾啾几个月前由米科,为他的电脑病毒和工作<已知的计算机安全专家HREF =htt​​p://www.ted.com/speakers/mikko_hypponen.html> TED演讲计算机的安全性。在这么尊重,我只会张贴它的一个形象,但你的想法。这显然​​不是你想要在你的网站上蔓延,吓坏了观众的东西。



在进一步的检查中,人物似乎是泰国的一个字母有超过合并87变音符号(是说没有限制?!)。这让我想起了安全性,本地化,以及如何人们可能会处理这样的投入。我的搜索导致我这个问题对堆栈,并依次从迈克尔·卡普兰一个博客帖子上剥离变音符号。在书中,他展示了如何对分解字符串转换成它的基地字符(这里简化为简洁起见):

  StringBuilder的SB =新的StringBuilder(); 
的foreach(在外观.Normalize(NormalizationForm.FormD字符C))
{
如果(char.GetUnicodeCategory(C)!= UnicodeCategory.NonSpacingMark)
sb.Append (C);
}
的Response.Write(sb.ToString()); //门面



我可以看到,这是会在某些情况下是有用的,但在条款用户输入,这将是剥离出所有的变音符号。正如卡普兰指出,在一些语言中删除变音符号可以彻底改变含义的词。这引出了一个问题:如何做一件允许用户输入/输出一些变音符号,但不包括其他极端的情况下,如米科的尤伯杯字


解决方案

时甚至有限制?!




不在Unicode中。有一个在UAX-15流皆宜的格式,设置30个组合限制的概念......一般的Unicode字符串不能保证数据流的安全,但是这当然可以被看作是一个迹象,表明的Unicode不打算标准化这将需要一个字形集群长于新的角色。



30仍然是一个可怕的很多。最长的已知的自然语言字形集群是藏族Hakṣhmalawarayaṁ 1基地加上8合,所以现在它是合理的正常化NFD并且不允许连续超过8个组合的任何序列。



如果你只关心普通西方欧洲语言你也许可以把那个降至2。所以那些之间可能损害地方。


The character pictured above was tweeted a few months ago by Mikko Hyppönen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the idea. It's obviously not something you'd want spreading around your website and freaking out visitors.

Upon further inspection, the character appears to be a letter of the Thai alphabet combined with over 87 diacritics (is there even a limit?!). This got me thinking about security, localization, and how one might handle this sort of input. My searching lead me to this question on Stack, and in turn a blog post from Michael Kaplan on stripping diacritics. In it, he demonstrates how one can decompose a string into its "base" characters (simplified here for the sake of brevity):

StringBuilder sb = new StringBuilder();
foreach (char c in "façade".Normalize(NormalizationForm.FormD))
{
    if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}
Response.Write(sb.ToString()); // facade 

I can see how that this is would be useful in some cases, but in terms of user input, it would be stripping out ALL diacritics. As Kaplan points out, removing the diacritics in some languages can completely change the meaning to the word. This begs the question: How does one permit some diacritics in user input/output, but exclude others extreme cases such as Mikko Hyppönen's über character?

解决方案

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

这篇关于如何防止...变音符号? (又名的Unicode组合字符/ Zalgo文本)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆