在网页(特别是社交媒体)中传播人物 [英] Messed up characters in webpages (especially social media)

查看:154
本文介绍了在网页(特别是社交媒体)中传播人物的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多人可能已经看到巨魔发布了奇怪的人物,这些怪物在社交媒体网站,论坛或视频流网站(如YouTube)上混乱了整个网页。

Many of you may have seen 'trolls' posting weird characters that messes up the whole webpage on social media sites, forums or video stream sites like youtube.

示例是附加的,我从Instagram捕获的图像,显​​示用户发布评论,将整个评论部分弄乱。

An example is attached, an image I captured from Instagram, showing a user posting a comment that messes up the entire comment section.

这样的事情怎么可能?为什么会发生?我们如何防止这样的事情发生在我们的网站?

How is such a thing possible? Why is it happening? And how can we prevent things like that from happening in our website?

推荐答案


这样的事情可能如何?

How is such a thing possible?

Unicode允许使用变音标记两种方式。

Unicode allows diacritical marks to be used in two ways.

第一个是组合表单,其中有一个单字符组合字母和变音符号,例如U + 00E9拉丁语小写字母E与急性é

The first is ‘composed’ form, where there is a single character for combined letter and diacritical, for example U+00E9 Latin Small Letter E with Acute é.

第二个是分解形式,其中您有一个字母的基本字母,然后一个单独的组合变音符号后。文本处理器和/或字体将这些字符的组合呈现为一个字形,例如U + 0065拉丁字母E,其后是U + 0301组合Acute é。这个优点(可以说是不利的),你可以写出没有组合字符的组合(通常是因为它们从未用于任何实际语言),例如 x

The second is ‘decomposed’ form where you have a character for the base letter and then a separate ‘combining diacritical’ character after it. The text processor and/or font render the combination of these characters as one grapheme, for example U+0065 Latin Small Letter E followed by U+0301 Combining Acute . The advantage (and arguably disadvantage) of this is you can write combinations that don't have combined characters (typically because they were never used in any real language), such as .

允许在单个字母上使用多个组合变音符,因为在一个字母上使用多个重音的语言(以及组合字符的其他技巧用于像韩国的Jamo和西藏人加入的信件)。没有固有的限制,可以使用多少组合字符来制作单个字母。

It's allowed to use multiple combining diacriticals on a single letter, as there are languages that use more than one accent on a letter (as well as other tricks combining characters are used for, like Korean Jamo and Tibetan joined letters). There is no inherent limit to how many combining characters may be used to make a single grapheme.

许多文本处理器将尝试通过将它们堆叠在一起来布局多个组合变音符号彼此的顶部(在另一个方向,为下面口音)。一般来说,这是一种合理的方法来尝试显示一个多重字符的字母,正在使用的字体没有特定的字形。但是它的确意味着你可以疯狂地使用荒谬的数字变音符装饰在正常的文本行之外。

Many text processors will try to lay out multiple combining diacriticals by piling them up on top of each other (and in the other direction, for ‘below’ accents). In general this is a reasonable way to attempt to show a multiply-accented letter that the font in use doesn't have a specific glyph for. But it does mean you can go crazy and use absurd numbers of diacriticals to decorate way outside the normal text line.


我们如何预防事情像在我们网站上发生的那样?

how can we prevent things like that from happening in our website?

简单的解决方案是将每个注释放在自己的块中,使用CSS overflow:hidden ,这样他们就不能逃脱到其他内容了。

Simple solution would be to put each comment in its own block with CSS overflow: hidden, so that they can't escape to other content.

另一个可能性是过滤输入多个组合字符的序列。例如,使用正则表达式,您可以删除:

Another possibility is to filter input for sequences of multiple combining characters. For example with regex you could remove:

\p{M}{9,}

因为8是最长序列目前以自然语言已知的组合器。如果您只关心简单的字母表,您可以尝试较少的数字。为此,您需要一个支持Unicode字符类( \p )的正则表达式引擎,这些语言本来不具有。如果您没有这种语言,但是您可以访问Unicode数据库(例如Python中的 unicodedata ),您可以手动浏览查找具有 M 字符类。

as 8 is the longest sequence of combiners known in a natural language at present. You could possibly try a lower number if you only care about simple alphabets. For this you would need a regex engine with support for Unicode character classes (\p), which some languages don't natively have. If you have a language without this but you do have access to the Unicode database (eg unicodedata in Python) you could manually walk over the characters looking for those with an M character class.

这篇关于在网页(特别是社交媒体)中传播人物的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆