PHP发誓单词过滤器 [英] PHP swear word filter
问题描述
我正在开发一个WordPress插件,该插件用列表中的随机新单词替换评论中的坏词.
I'm working on a WordPress plugin that replaces the bad words from the comments with random new ones from a list.
我现在有2个数组:一个包含坏词,另一个包含好词.
I now have 2 arrays: one containing the bad words and another containing the good words.
$bad = array("bad", "words", "here");
$good = array("good", "words", "here");
由于我是初学者,所以在某个时候被卡住了.
Since I'm a beginner, I got stuck at some point.
为了替换坏词,我一直在使用$newstring = str_replace($bad, $good, $string);
.
In order to replace the bad words, I've been using $newstring = str_replace($bad, $good, $string);
.
我的第一个问题是我想关闭区分大小写的功能,所以我不会放这样的单词"bad", "Bad", "BAD", "bAd", "BAd", etc
,但是我需要新单词来保持原始单词的格式,例如,如果我写差",它将替换为单词",但是如果我输入差",它将替换为单词",等等.
My first problem is that I want to turn off the case sensivity, so I won't put the words like this "bad", "Bad", "BAD", "bAd", "BAd", etc
but I need the new word to keep the format of the original word, for example if I write "Bad", it would be replaced with "Words", but if I type "bad", it would be replaced with "words", etc.
我的第一个强项是使用str_ireplace
,但是它忘记了原始单词是否有大写字母.
My first tought was to use str_ireplace
, but it forgets if the original word had a capital letter.
第二个问题是,我不知道如何与这样的用户打交道:"b a d","w o r d s"等.我需要一个主意.
The second problem is that I don't know how to deal with the users that type like this: "b a d", "w o r d s", etc. I need an idea.
为了使其选择一个随机词,我想我可以先使用$new = $good[rand(0, count($good)-1)];
然后使用$newstring = str_replace($bad, $new, $string);
.如果您有更好的主意,我在这里听.
In order to make it select a random word, I think I can use $new = $good[rand(0, count($good)-1)];
then $newstring = str_replace($bad, $new, $string);
. If you have a better idea, I'm here to listen.
我的脚本的一般外观:
function noswear($string)
{
if ($string)
{
$bad = array("bad", "words");
$good = array("good", "words");
$newstring = str_replace($bad, $good, $string);
return $newstring;
}
echo noswear("I see bad words coming!");
预先感谢您的帮助!
推荐答案
前体
(如无数次评论中所指出),通过实现这种功能,您和/或您的代码会陷入僵局,仅举几例:
Precursor
There are (as has been pointed out in the comments numerous times) gaping wholes for you - and/or your code - to fall into through implementing such a feature, to name but a few:
- 人们会将字符添加到傻瓜过滤器
- 人们将成为 creative (例如innuendo)
- 人们会使用被动攻击和嘲讽
- 人们不仅会使用单词,还会使用句子/短语
- People will add characters to fool the filter
- People will become creative (e.g. innuendo)
- People will use passive aggression and sarcasm
- People will use sentences/phrases not just words
您最好实施一个审核/举报系统,使人们可以举报令人反感的评论,然后由mods,用户等进行编辑/删除.
You'd do better to implement a moderation/flagging system where people can flag offensive comments which can then be edited/removed by mods, users, etc.
基于这种理解,让我们继续...
On that understanding, let us proceed...
鉴于您:
- 具有禁止的单词列表
$bad_words
- 具有替换单词列表
$good_words
- 想要替换坏词无论大小写
- 想用随机好词替换坏词
- 具有正确转义的坏词列表:请参见 http://php.net/preg_quote
- Have a forbidden word list
$bad_words
- Have a replacement word list
$good_words
- Want to replace bad words regardless of case
- Want to replace bad words with random good words
- Have a correctly escaped bad word list: see http://php.net/preg_quote
您可以非常轻松地使用PHP
的preg_replace_callback
函数:
You can very easily use PHP
s preg_replace_callback
function:
$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?';
$bad_words = array('could', 'would', 'should');
$good_words = array('might', 'will');
function replace_words($matches){
global $good_words;
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}
echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);
好的,所以preg_replace_callback
的作用是编译包含所有坏词的正则表达式模式.然后,匹配项将采用以下格式:
Okay, so what the preg_replace_callback
does is it compiles a regex pattern consisting of all of the bad words. Matches will then be in the format:
/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i
i
修饰符使其不区分大小写,因此bad
和Bad
都将匹配.
The i
modifier makes it case insensitive so both bad
and Bad
would match.
函数replace_words
然后获取匹配的单词及其边界(空白或空白字符),并用边界和随机的好单词替换.
The function replace_words
then takes the matched word and it's boundaries (either blank or a white space character) and replaces it with the boundaries and a random good word.
global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words
匿名函数
您可以使用preg_replace_callback
echo preg_replace_callback(
'/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
function ($matches) use ($good_words){
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
},
$input_string
);
函数包装器
如果要多次使用它,也可以将其编写为自包含函数,尽管在这种情况下,您很可能会在调用时将好/不好的单词输入该函数中它(或在其中永久性地对其进行硬编码),但这取决于您如何导出它们...
Function wrapper
If you're going to use it multiple times you may also write it as a self-contained function, although in this case you're most likely going to want to feed the good/bad words in to the function when calling it (or hard code them in there permanently) but that depends on how you derive them...
function clean_string($input_string, $bad_words, $good_words){
return preg_replace_callback(
'/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
function ($matches) use ($good_words){
return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
},
$input_string
);
}
echo clean_string($input_string, $bad_words, $good_words);
输出
使用第一个示例中显示的输入和单词列表连续运行以上功能:
Output
Running the above functions consecutively with the input and word lists shown in the first example:
This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?
当然替换词是随机选择的,因此,如果刷新页面,我还会得到其他东西....但这显示了什么/没有被替换.
Of course the replacement words are chosen randomly so if I refreshed the page I'd get something else... But this shows what does/doesn't get replaced.
foreach($bad_words as $key=>$word){
$bad_words[$key] = preg_quote($word);
}
单词边界\b
在这段代码中,我使用\b
,\s
和^
或$
作为单词边界,这是有充分理由的.虽然white space
,start of string
和end of string
都被视为单词边界,但\b
在所有情况下均不匹配,例如:
Word boundaries \b
In this code I've used \b
, \s
, and ^
or $
as word boundaries there is a good reason for this. While white space
, start of string
, and end of string
are all considered word boundaries \b
will not match in all cases, for example:
\b\$h1t\b <---Will not match
这是因为\b
与非单词字符(即[^a-zA-Z0-9]
)匹配,并且$
之类的字符不算作单词字符.
This is because \b
matches against non-word characters (i.e. [^a-zA-Z0-9]
) and characters like $
don't count as word characters.
取决于单词列表的大小,可能会有几个潜在的问题.从系统设计的角度来看, huge 正则表达式通常是不好的形式,其原因如下:
Depending on the size of your word list there are a couple of potential hiccups. From a system design perspective it's generally bad form to have huge regexes for a couple of reasons:
- 可能难以维护
- It can be difficult to maintain
- 很难读懂它的作用
- 很难发现错误
鉴于正则表达式模式是由PHP
编译的,第一个原因被否定了.第二个也应该被否定;如果您的单词列表是 large ,并且每个坏单词都有很多排列,那么我建议您停止并重新考虑您的方法(阅读:使用标记/审核系统).
Given that the regex pattern is compiled by PHP
the first reason is negated. The second should be negated as well; if you're word list is large with a dozen permutations of each bad word then I suggest you stop and rethink your approach (read: use a flagging/moderation system).
澄清一下,我认为没有一个小单词列表可以过滤掉特定的词义,因为它的目的是:阻止用户彼此爆发;当您尝试过滤掉太多(包括排列)时,就会出现问题.坚持过滤常见的脏话,如果这不起作用,那么-最后一次 -实施标记/审核系统.
To clarify, I don't see a problem have a small word list to filter out specific expletives as it serves a purpose: to stop users from having an outburst at one another; the problem comes when you try to filter out too much including permutations. Stick to filtering common swear words and if that doesn't work then - for the last time - implement a flagging/moderation system.
这篇关于PHP发誓单词过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!