使用漏洞从文本中过滤单词 [英] filtering words from text with exploits

查看:38
本文介绍了使用漏洞从文本中过滤单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有过滤器可以过滤诸如屁股"操"等坏词.现在我正在尝试处理诸如f*ck"、sh/t"之类的漏洞.

I have filter which filters bad words like 'ass' 'fuck' etc. Now I am trying to handle exploits like "f*ck", "sh/t".

我可以做的一件事是将每个单词与具有此类漏洞的坏词词典进行匹配.但这是相当静态的,不是好方法.

One thing I could do is matching each words with dictionary of bad word having such exploits. But this is pretty static and not good approach.

我可以做的另一件事是使用 levenshtein 距离.levenshtein distance = 1 的单词应该被屏蔽.但这种方法也容易出现误报.

Another thing I can do is, using levenshtein distance. Words with levenshtein distance = 1 should be blocked. But this approach also prone to give false positive.

if(!ctype_alpha($text)&& levenshtein('shit', $text)===1)
{
//match
}

我正在寻找某种使用正则表达式的方法.也许我可以将 levenshtein 距离与 regex 结合起来,但我无法弄清楚.

I am looking for some way of using regex. May be I can combine levenshtein distance with regex, but I could not figure it out.

任何建议都值得高度赞赏.

Any suggestion is highly appreciable.

推荐答案

正如评论中所述,很难做到这一点.这个片段远非完美,将检查字母替换为相同数量的其他字符的匹配项.

Like stated in the comments, it is hard to get this right. This snippet, far from perfect, will check for matches where letters are substituted for the same number of other characters.

它可以让您大致了解如何解决这个问题,但如果您想让它更智能,则需要更多的逻辑.例如,此过滤器不会过滤 'fukk'、'f ck'、'f**ck'、'fck'、'.fuck'(带前导点)或 'fück',而它可能会过滤掉 '++++' 将其替换为 'beep'.但它也会过滤 'f*ck'、'f**k'、'f*cking' 和 'sh1t',所以它可能会做得更糟.:)

It may give you a general idea of how you could solve this, although much more logic is needed if you want to make it smarter. This filter, for instance will not filter 'fukk', 'f ck', 'f**ck', 'fck', '.fuck' (with leading dot) or 'fück', while it does probably filter out '++++' to replace it with 'beep'. But it also filters 'f*ck', 'f**k', 'f*cking' and 'sh1t', so it could do worse. :)

一个简单的改进方法是以更智能的方式拆分字符串,这样标点符号就不会粘在它们相邻的单词上.另一个改进可能是从每个单词中删除所有非字母字符,并检查剩余字母在单词中的顺序是否相同.这样,'f\/ck' 也会匹配 'fuck'.无论如何,让您的想象力尽情发挥,但要小心误报.相信我,他们"总能找到一种绕过过滤器的方式来表达自己.

An easy way to make it better, is to split the string in a smarter way, so punctuation marks aren't glued to the word they are adjacent to. Another improvement could be to remove all non-alphabetic characters from each word, and check if the remaining letters are in the same order in a word. That way, 'f\/ck' would also match 'fuck'. Anyway, let your imagination run wild, but be careful for false positives. And trust me that 'they' will always find a way to express themselves in a way that bypasses your filter.

<?php 
$badwords = array('shit', 'fuck');
$text = 'Man, I shot this f*ck, sh/t! fucking fucker sh!t fukk. I love this. ;)';
$words = explode(' ', $text);

// Loop through all words.
foreach ($words as $word)
{
  $naughty = false;
  // Match each bad word against each word.
  foreach ($badwords as $badword)
  {
    // If the word is shorter than the bad word, it's okay. 
    // It may be bigger. I've done this mainly, because in the example given, 
    // 'f*ck,' will contain the trailing comma. This could be easily solved by
    // splitting the string a bit smarter. But the added benefit, is that it also
    // matches derivatives, like 'f*cking' or 'f*cker', although that could also 
    // result in more false positives.
    if (strlen($word) >= strlen($badword))
    {
      $wordOk = false;
      // Check each character in the string.
      for ($i = 0; $i < strlen($badword); $i++)
      {
        // If the letters don't match, and the letter is an actual 
        // letter, this is not a bad word.
        if ($badword[$i] !== $word[$i] && ctype_alpha($word[$i]))
        {
          $wordOk = true;
          break;
        }
      }
      // If the word is not okay, break the loop.
      if (!$wordOk)
      {
        $naughty = true;
        break;
      }
    }
  }

  // Echo the sensored word.
  echo $naughty ? 'beep ' : ($word . ' ');
}

这篇关于使用漏洞从文本中过滤单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆