PHP发誓单词过滤器 [英] PHP swear word filter

查看:67
本文介绍了PHP发誓单词过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个WordPress插件,该插件用列表中的随机新单词替换评论中的坏词.

I'm working on a WordPress plugin that replaces the bad words from the comments with random new ones from a list.

我现在有2个数组:一个包含坏词,另一个包含好词.

I now have 2 arrays: one containing the bad words and another containing the good words.

$bad = array("bad", "words", "here");
$good = array("good", "words", "here");

由于我是初学者,所以在某个时候被卡住了.

Since I'm a beginner, I got stuck at some point.

为了替换坏词,我一直在使用$newstring = str_replace($bad, $good, $string);.

In order to replace the bad words, I've been using $newstring = str_replace($bad, $good, $string);.

我的第一个问题是我想关闭区分大小写的功能,所以我不会放这样的单词"bad", "Bad", "BAD", "bAd", "BAd", etc,但是我需要新单词来保持原始单词的格式,例如,如果我写差",它将替换为单词",但是如果我输入差",它将替换为单词",等等.

My first problem is that I want to turn off the case sensivity, so I won't put the words like this "bad", "Bad", "BAD", "bAd", "BAd", etc but I need the new word to keep the format of the original word, for example if I write "Bad", it would be replaced with "Words", but if I type "bad", it would be replaced with "words", etc.

我的第一个强项是使用str_ireplace,但是它忘记了原始单词是否有大写字母.

My first tought was to use str_ireplace, but it forgets if the original word had a capital letter.

第二个问题是,我不知道如何与这样的用户打交道:"b a d","w o r d s"等.我需要一个主意.

The second problem is that I don't know how to deal with the users that type like this: "b a d", "w o r d s", etc. I need an idea.

为了使其选择一个随机词,我想我可以先使用$new = $good[rand(0, count($good)-1)];然后使用$newstring = str_replace($bad, $new, $string);.如果您有更好的主意,我在这里听.

In order to make it select a random word, I think I can use $new = $good[rand(0, count($good)-1)]; then $newstring = str_replace($bad, $new, $string);. If you have a better idea, I'm here to listen.

我的脚本的一般外观:

function noswear($string)
{
    if ($string)
    {       
        $bad = array("bad", "words");
        $good = array("good", "words"); 
        $newstring = str_replace($bad, $good, $string);     
        return $newstring;
}

echo noswear("I see bad words coming!");

预先感谢您的帮助!

推荐答案

前体

(如无数次评论中所指出),通过实现这种功能,您和/或您的代码会陷入僵局,仅举几例:

Precursor

There are (as has been pointed out in the comments numerous times) gaping wholes for you - and/or your code - to fall into through implementing such a feature, to name but a few:

  1. 人们会将字符添加到傻瓜过滤器
  2. 人们将成为 creative (例如innuendo)
  3. 人们会使用被动攻击和嘲讽
  4. 人们不仅会使用单词,还会使用句子/短语
  1. People will add characters to fool the filter
  2. People will become creative (e.g. innuendo)
  3. People will use passive aggression and sarcasm
  4. People will use sentences/phrases not just words

您最好实施一个审核/举报系统,使人们可以举报令人反感的评论,然后由mods,用户等进行编辑/删除.

You'd do better to implement a moderation/flagging system where people can flag offensive comments which can then be edited/removed by mods, users, etc.

基于这种理解,让我们继续...

On that understanding, let us proceed...

鉴于您:

  1. 具有禁止的单词列表$bad_words
  2. 具有替换单词列表$good_words
  3. 想要替换坏词无论大小写
  4. 想用随机好词替换坏词
  5. 具有正确转义的坏词列表:请参见 http://php.net/preg_quote
  1. Have a forbidden word list $bad_words
  2. Have a replacement word list $good_words
  3. Want to replace bad words regardless of case
  4. Want to replace bad words with random good words
  5. Have a correctly escaped bad word list: see http://php.net/preg_quote

您可以非常轻松地使用PHPpreg_replace_callback函数:

You can very easily use PHPs preg_replace_callback function:

$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?';

$bad_words  = array('could', 'would', 'should');
$good_words = array('might', 'will');

function replace_words($matches){
    global $good_words;
    return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}

echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);

好的,所以preg_replace_callback的作用是编译包含所有坏词的正则表达式模式.然后,匹配项将采用以下格式:

Okay, so what the preg_replace_callback does is it compiles a regex pattern consisting of all of the bad words. Matches will then be in the format:

/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i

i修饰符使其不区分大小写,因此badBad都将匹配.

The i modifier makes it case insensitive so both bad and Bad would match.

函数replace_words然后获取匹配的单词及其边界(空白或空白字符),并用边界和随机的好单词替换.

The function replace_words then takes the matched word and it's boundaries (either blank or a white space character) and replaces it with the boundaries and a random good word.

global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after  the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words

匿名函数

您可以使用preg_replace_callback

echo preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );

函数包装器

如果要多次使用它,也可以将其编写为自包含函数,尽管在这种情况下,您很可能会在调用时将好/不好的单词输入该函数中它(或在其中永久性地对其进行硬编码),但这取决于您如何导出它们...

Function wrapper

If you're going to use it multiple times you may also write it as a self-contained function, although in this case you're most likely going to want to feed the good/bad words in to the function when calling it (or hard code them in there permanently) but that depends on how you derive them...

function clean_string($input_string, $bad_words, $good_words){
    return preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );
}

echo clean_string($input_string, $bad_words, $good_words);

输出

使用第一个示例中显示的输入和单词列表连续运行以上功能:

Output

Running the above functions consecutively with the input and word lists shown in the first example:

This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?

当然替换词是随机选择的,因此,如果刷新页面,我还会得到其他东西....但这显示了什么/没有被替换.

Of course the replacement words are chosen randomly so if I refreshed the page I'd get something else... But this shows what does/doesn't get replaced.

foreach($bad_words as $key=>$word){
    $bad_words[$key] = preg_quote($word);
}

单词边界\b

在这段代码中,我使用\b\s^$作为单词边界,这是有充分理由的.虽然white spacestart of stringend of string都被视为单词边界,但\b在所有情况下均不匹配,例如:

Word boundaries \b

In this code I've used \b, \s, and ^ or $ as word boundaries there is a good reason for this. While white space, start of string, and end of string are all considered word boundaries \b will not match in all cases, for example:

\b\$h1t\b <---Will not match

这是因为\b与非单词字符(即[^a-zA-Z0-9])匹配,并且$之类的字符不算作单词字符.

This is because \b matches against non-word characters (i.e. [^a-zA-Z0-9]) and characters like $ don't count as word characters.

取决于单词列表的大小,可能会有几个潜在的问题.从系统设计的角度来看, huge 正则表达式通常是不好的形式,其原因如下:

Depending on the size of your word list there are a couple of potential hiccups. From a system design perspective it's generally bad form to have huge regexes for a couple of reasons:

  1. 可能难以维护
  1. It can be difficult to maintain
  1. 很难读懂它的作用
  2. 很难发现错误

  • 如果列表太大,可能会占用大量内存
  • It can be memory intensive if the list is too large
  • 鉴于正则表达式模式是由PHP编译的,第一个原因被否定了.第二个也应该被否定;如果您的单词列表是 large ,并且每个坏单词都有很多排列,那么我建议您停止并重新考虑您的方法(阅读:使用标记/审核系统).

    Given that the regex pattern is compiled by PHP the first reason is negated. The second should be negated as well; if you're word list is large with a dozen permutations of each bad word then I suggest you stop and rethink your approach (read: use a flagging/moderation system).

    澄清一下,我认为没有一个小单词列表可以过滤掉特定的词义,因为它的目的是:阻止用户彼此爆发;当您尝试过滤掉太多(包括排列)时,就会出现问题.坚持过滤常见的脏话,如果这不起作用,那么-最后一次 -实施标记/审核系统.

    To clarify, I don't see a problem have a small word list to filter out specific expletives as it serves a purpose: to stop users from having an outburst at one another; the problem comes when you try to filter out too much including permutations. Stick to filtering common swear words and if that doesn't work then - for the last time - implement a flagging/moderation system.

    这篇关于PHP发誓单词过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆