在页面而非其他单词中查找被禁止的单词 [英] Finding Banned Words On A Page And Not Within Other Words

查看:67
本文介绍了在页面而非其他单词中查找被禁止的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将禁词过滤器添加到Web代理。
我不是在页面上其他单词内搜索被禁止的单词,而是在已加载页面内搜索被禁止的单词。
我实际上不是在其他单词内而是页面内(元标记,内容)寻找禁止的单词。

I am trying to add a banned words filter onto a web proxy. I am NOT searching for banned words within other words on a page but searching for banned words within a loaded page. I am not actually looking for banned words within other words but within the page (meta tags, content).

因此,如果我要查找公鸡一词,则公鸡一词不应触发过滤器。

And so, if I am looking for the word "cock", then the word "cockerel" should not trigger the filter.

我刚刚测试了这段代码,是的,正如预期的那样,该代码可以正常工作,但是您可以猜测,有很多cpu电源循环通过。页面加载的一瞬间,另一时它变成灰色,并显示出页面加载时间过长的迹象。而所有这些都在本地主机上。现在,我可以想象我的虚拟主机会做什么!
所以现在,我们将不得不提出一个更好的解决方案。有任何想法吗 ?
怎么办,我们没有让脚本在加载的页面上检查所有禁止的单词?当找到1个被禁止的单词并回显了哪个被禁止的单词以及页面上的位置后,我们如何停止脚本? (元标记,正文内容等)。
有任何代码建议吗?

I just tested this code and, yes, as expected the code works but as you can guess there is a lot of cpu power cycling through. One moment the page loads, the other moment it goes grey and shows signs that the page is taking too long to load. And all this on localhost. Now, I can imagine what my webhost would do! So now, we will have to come-up with a better solution. Any ideas ? How-about we do not get the script to check on the loaded page for all the banned words ? How-about we get the script to halt as soon as 1 banned word is found and an echo has been made which banned word has been found and where on the page ? (meta tags, body content, etc.). Any code suggestions ?

这是我到目前为止所得到的:

Here is what I got so far:

<?php

/*
ERROR HANDLING
*/

// 1). $curl is going to be data type curl resource.
$curl = curl_init();

// 2). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 3). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if( $response['http_code'] == '200' )
    {
        //Set banned words.
        $banned_words = array("Prick","Dick","***");

        //Separate each words found on the cURL fetched page.
        $word = explode(" ", $result);

       //var_dump($word);

       for($i = 0; $i <= count($word); $i++)
       {
           foreach ($banned_words as $ban) 
           {
              if (strtolower($word[$i]) == strtolower($ban))
              {
                  echo "word: $word[$i]<br />";
                  echo "Match: $ban<br>";
           }
          else
           {
                 echo "word: $word[$i]<br />";
                 echo "No Match: $ban<br>";  
            }
         }
      }
   }  

// 4). Close cURL resource.
curl_close($curl);

我被告知要这样做:

将页面加载到字符串中。
在加载的字符串上使用带有单词边界的preg_match并遍历您所禁止的单词。

Q1,如何将页面加载到一个字符串?
但是,我不知道如何开始。因此,任何示例代码都将受到包括我在内的所有新手的赞赏。
欢迎使用任何代码示例。

Q1, How do I load the page into a string ? But, I have no clue how to start on this. And so, any sample code would be appreciated by all newbies including me. Any code samples welcome.

更新:
我更新了插入miknik代码的代码。一切正常,直到我在cURL之前添加以下行:
$ banned_words = array( Prick, Dick, ***);

UPDATE: I updated my code inserting miknik's codes. It was working fine until I added this line before the cURL: $banned_words = array("Prick","Dick","***");

这里是更新:

<?php

/*
ERROR HANDLING
*/

// 1). Set banned words.
$banned_words = array("Prick","Dick","***");

// 2). $curl is going to be data type curl resource.
$curl = curl_init();

// 3). Set cURL options.
curl_setopt($curl, CURLOPT_URL, 'https://www.buzzfeed.com/mjs538/the-68-
words-
you-cant-say-on-tv?utm_term=.xlN0R1Go89#.pbdl8dYm3X');
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true );

// 4). Run cURL (execute http request).
$result = curl_exec($curl);
$response = curl_getinfo( $curl );

if($response['http_code'] == '200' )
     {
          $regex = '/\b';      // The beginning of the regex string syntax
          $regex .= implode('\b|\b', $banned_words);      // joins all the 
          banned words to the string with correct regex syntax
          $regex .= '\b/i';    // Adds ending to regex syntax. Final i makes 
          it case insensitive
          $substitute = '****';
          $cleanresult = preg_replace($regex, $substitute, $result);
          echo $cleanresult;
     }

  curl_close($curl);

  ?>


推荐答案

您已经将页面内容作为字符串包含了,在 $ result

You have the page content as a string already, it's in $result

preg_match 中可以,但是该怎么做当找到匹配项时您想做什么? preg_replace 如果要过滤禁止的单词更合适。

preg_match will work but what do you then want to do when you find a match? preg_replace is more appropriate if you want to filter the banned words.

不需要将字符串分解为单个单词,这样做只会增加很多cpu开销。照原样处理 $ result 变量。

There is no need to explode the string into individual words, you are just adding a lot of cpu overhead by doing so. Process the $result variable as is.

因此,首先从您的禁词数组中构造一个正则表达式字符串。匹配每个单词的基本语法是 \bXXXX\b ,其中XXXX是您禁止的单词。每端 \b 表示它必须在单词边界处,因此 \bockockb 比赛公鸡和公鸡!

So first off construct a regex string from your array of banned words. A basic syntax for matching each word is \bXXXX\b where XXXX is your banned word. \b at each end means that it must be at a word boundary, so \bcock\b would match cock and cock! but not cockerel.

$regex = '/\b';      // The beginning of the regex string syntax
$regex .= implode('\b|\b', $banned_words);      // joins all the banned words to the string with correct regex syntax
$regex .= '\b/i';    // Adds ending to regex syntax. Final i makes it case insensitive

现在您可以在 $上运行单个操作结果,并获得一个新的字符串,其中包含所有被禁词。设置您的值以替换每个禁止的单词

Now you can run a single operation on $result and get a new string with all the banned words censored. Set your value to be substituted for each banned word

$substitute = '****';

然后执行替换

$cleanresult = preg_replace($regex, $substitute, $result);

假设 $ result =‘你是一只公鸡!你挑刺! ';

echo $ cleanresult 返回您是一个****!你****!你真是个****。

echo $cleanresult returns You are a ****! You ****! You are such a ****.

这篇关于在页面而非其他单词中查找被禁止的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆