使用正则表达式的亵渎过滤器(100个单词的列表) [英] Profanity Filter using a Regular Expression (list of 100 words)

查看:162
本文介绍了使用正则表达式的亵渎过滤器(100个单词的列表)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从给定的字符串中删除亵渎单词的正确方法是什么:
1)我有一个要在字符串数组中查找的100个单词的列表. 2)处理部分单词的正确方法是什么?大多数人如何处理?例如质量一词.然后有时不完整的单词也是不好的-假设foobar是一个极度亵渎的单词,我可能要禁止foobar和foobar *和* foobar.

What is the correct way to strip profane words from a string given:
1) I have a list of 100 words to look for in an array of strings. 2) What is the correct way to handle partial words? How do most people handle this? For example the word mass. Then sometimes a partial word is also bad - assume foobar is an extremely profane word I may want to disallow foobar and foobar* and *foobar.

那么您将所有单词都放在一个表达式中还是在列表中循环?

So do you put all the words into a single expression or loop through the list?

解决该问题的正确方法是什么?我正在使用Groovy/Grails,但欢迎使用任何现代语言示例.

What's the right way to tackle it? I'm using Groovy/Grails but any modern languages examples welcome.

推荐答案

这是一个很难解决的问题,您需要确定正则表达式是否对您有用,以及如何处理嵌入(将亵渎字词添加到字典中时)像frackface,但带有真正的F字).

This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

正则表达式通常对它们的长度有限制,这通常会阻止您对所有单词使用单个正则表达式.对一个字符串执行多个正则表达式的速度确实很慢,具体取决于所需的性能和黑名单的大小.最初,我们将 CleanSpeak 实现为正则表达式系统,但它无法缩放,因此我们使用了不同的机制对其进行了重新编写.

Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

您还需要考虑短语,标点符号,空格,讲方言和其他语言.所有这些使正则表达式作为解决方案不那么吸引人.以下是一些使用hello一词的示例(假设这是亵渎行为):

You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

  • 列表项
  • h e l l o
  • h.e.l.o
  • h_e_l_l_o
  • |-|你好
  • h3llo
  • 你好,那里"(此短语可能不包含任何亵渎性词语,但组合起来就是亵渎性语言)

您还需要处理两个或更多个词典(白名单)单词彼此相邻时包含亵渎行为的极端情况.包含s-word的一些示例:

You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

  • 扑灭
  • 这是安静的时间

这些显然不是亵渎的,但是大多数本地出产的和许多商业解决方案在这些情况下都存在问题.

These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

过去三年来,我们一直在完善 CleanSpeak 所使用的过滤器,以确保能够处理所有这些情况,我们继续进行调整,使其变得更好.我们还花了8个月的时间完善我们的性能系统,它每秒可以处理大约5,000条消息.并不是说您无法构建某些有用的东西,而是要做好准备处理很多可能出现的问题,并创建一个不使用正则表达式的系统.

We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.

这篇关于使用正则表达式的亵渎过滤器(100个单词的列表)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆