如何防止变音符号,例如Zalgo文字 [英] How to protect against diacritics such as Zalgo text

查看:104
本文介绍了如何防止变音符号,例如Zalgo文字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



上面的字符几个月前由计算机安全专家 MikkoHyppönen发了推文。计算机病毒和 TED对话。关于SO,我只会发布它的图片,但是您会明白的。



进一步检查后,该字符似乎是泰语字母,加上87个变音符号(甚至有极限吗?!)。这让我开始思考安全性,本地化以及如何处理这种输入。我的搜索使我想到了这个问题在Stack上,然后是Michael Kaplan在剥离变音符号。在其中,他演示了如何将字符串分解为基本字符(为简洁起见在此简化):

  StringBuilder sb = new StringBuilder(); 
foreach(façade中的字符c。Normalize(NormalizationForm.FormD))
{
if(char.GetUnicodeCategory(c)!= UnicodeCategory.NonSpacingMark)
sb.Append (C);
}
Response.Write(sb.ToString()); //外观

我可以看到这在某些情况下会很有用,但在用户输入,它将删除所有变音符号。正如Kaplan指出的那样,删除某些语言中的变音符号可以完全改变单词的含义。这就引出了一个问题:如何允许某些变音符号出现在用户输入/输出中,而又不包括其他极端情况,例如MikkoHyppönen的über角色?


甚至还有限制吗?!


在Unicode中。 UAX-15中存在流安全格式的概念,该格式设置了30个组合器的限制...一般而言,不能保证Unicode字符串是流安全的,但是可以肯定地将其视为Unicode的标志不要打算标准化需要更长的字素簇的新字符。



30仍然很糟糕。已知最长的自然语言字素簇是藏语Hakṣhmalawarayaṁ再加上8个组合器,因此现在将其标准化为NFD并禁止连续超过8个组合器的序列是合理的。



如果您只关心常见的Western欧洲语言,您可能可以将其降低到2种。因此,在这两种语言之间可能会有所折衷。

The character pictured above was tweeted a few months ago by Mikko Hyppönen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the idea. It's obviously not something you'd want spreading around your website and freaking out visitors.

Upon further inspection, the character appears to be a letter of the Thai alphabet combined with over 87 diacritics (is there even a limit?!). This got me thinking about security, localization, and how one might handle this sort of input. My searching lead me to this question on Stack, and in turn a blog post from Michael Kaplan on stripping diacritics. In it, he demonstrates how one can decompose a string into its "base" characters (simplified here for the sake of brevity):

StringBuilder sb = new StringBuilder();
foreach (char c in "façade".Normalize(NormalizationForm.FormD))
{
    if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}
Response.Write(sb.ToString()); // facade 

I can see how that this is would be useful in some cases, but in terms of user input, it would be stripping out ALL diacritics. As Kaplan points out, removing the diacritics in some languages can completely change the meaning to the word. This begs the question: How does one permit some diacritics in user input/output, but exclude others extreme cases such as Mikko Hyppönen's über character?

解决方案

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

这篇关于如何防止变音符号,例如Zalgo文字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆