如何防止Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ文字? [英] How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?

查看:117
本文介绍了如何防止Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经了解了 Zalgo文本的工作方式,我正在寻找了解聊天或论坛软件如何防止这种烦恼。更确切地说,假设需要删除聊天参与者仅使用不<的语言,则需要删除的Unicode组合字符的完整集合是什么?



a) em> require 组合标记(例如,您可以用组合标记写未婚夫,但是如果您坚持这样做的话,Zalgo会感到有些困惑);或者,



b)减少到最多8个连续字符(在实际语言)?



编辑:同时,我发现了一个完全不同的措词问题( 如何防范...变音符号?),这是与此基本相同。我将其标题更明确地显示出来,以便其他人也可以找到它。

解决方案

假设您对此很认真并且想要一个技术解决方案,您可以按照以下步骤操作:


  1. 将输入的文本拆分为较小的单位(单词或句子);

  2. 使用您选择的字体呈现服务器上的每个单元(行高很大,并且在Zalgo的杂讯将要到达的基线以下有很多空间);

  3. 训练一种机器学习算法来判断它看起来是否太黑和忙;

  4. 如果算法的置信度低,则取决于人类主持人。

实施起来可能很有趣,但实际上最好直接走到第四步。



编辑:这是Python 2.7中更实用的解决方案。归类为无间距标记 无框标记 的Unicode字符似乎是用于创建Zalgo效果的主要工具。与上述想法不同,这不会尝试确定文本的美学,而只会删除所有此类字符。 (毋庸置疑,这将破坏许多多种语言的文本。请继续阅读以寻求更好的解决方案。)要过滤出更多字符类别,请将其添加到 ZALGO_CHAR_CATEGORIES 中。

 #!/ usr / bin / env python 
导入unicodedata
导入编解码器

ZALGO_CHAR_CATEGORIES = ['Mn','Me']

,其中codecs.open( zalgo,'r','utf-8')作为文件名:
用于行infile中:
print''.join([unicodedata.normalize('NFD',line)中的c为c,如果unicodedata.category(c)不在ZALGO_CHAR_CATEGORIES中)],

示例输入:

  1 
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ¿ͦ̂̀̚ ̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ́͗̓͟͜s͙͔̺͇̗̱̿¿̇͞ b
$ b

输出:

  1 
Zalgo文本如何工作?
2
Zalgo文本如何工作?
3

最后,如果您要查找而不是无条件删除Zalgo,您可以执行字符频率分析的文本。下面的程序对输入文件的每一行执行该操作。函数 is_zalgo 计算给定字符串的每个单词的 Zalgo分数(分数是潜在的Zalgo字符数除以字符总数)。然后,查看单词得分的第三个四分位数是否大于 THRESHOLD 。如果 THRESHOLD 等于 0.5 ,则意味着我们正在尝试检测每四个单词中是否有50%以上的单词Zalgo字符。 (猜测的 THRESHOLD 为0.5,可能需要针对实际使用进行调整。)就收益/编码工作而言,这种算法可能是最好的。

 #!/ usr / bin / env python 
from __future__导入部门
导入unicodedata
导入编解码器
import numpy

ZALGO_CHAR_CATEGORIES = ['Mn','Me']
THRESHOLD = 0.5
DEBUG =真

def is_zalgo( s):如果len(s)== 0:
返回False
word_scores = [] s.split()中单词的

cats = [ unicodedata.category(c for word in c)
score = sum([cats.count(banned)for ZALGO_CHAR_CATEGORIES])/ len(word)
word_scores.append(score)
total_score = numpy.percentile(word_scores,75)
如果调试:
print total_score
return total_score>阈值

,其中codecs.open( zalgo,'r','utf-8')作为infile:
用于infile中的行:
print is_zalgo(unicodedata.normalize ( NFD,行)), \t,行

示例输出:

  0.911483990148 
真实的塞诺尔语,您或您的未婚夫可以解释吗,̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡?͓̲͙͋ͬ? $ b 0.333333333333
错误Přílišžluťoučkýkůňúpělbelábelskéódy。


I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:

a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,

b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?

EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.

解决方案

Assuming you're very serious about this and want a technical solution you could do as follows:

  1. Split the incoming text into smaller units (words or sentences);
  2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
  3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";
  4. If the algorithm's confidence is low defer to human moderators.

This could be fun to implement but in practice it would likely be better to go to step four straight away.

Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.

#!/usr/bin/env python
import unicodedata
import codecs

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),

Example input:

1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3

Output:

1
How does Zalgo text work?
2
How does Zalgo text work?
3

Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.

#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True

def is_zalgo(s):
    if len(s) == 0:
        return False
    word_scores = []
    for word in s.split():
        cats = [unicodedata.category(c) for c in word]
        score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
        word_scores.append(score)
    total_score = numpy.percentile(word_scores, 75)
    if DEBUG:
        print total_score
    return total_score > THRESHOLD

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line

Sample output:

0.911483990148
True    Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡

0.333333333333
False   Příliš žluťoučký kůň úpěl ďábelské ódy.  

这篇关于如何防止Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆