从网站内容PHP过滤JavaScript [英] Filtering javascript from site content PHP

查看:156
本文介绍了从网站内容PHP过滤JavaScript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在制作一个脚本,根据用户提交的URL检查页面的关键字密度,并且我一直使用strip_tags,但它似乎并未完全过滤实际单词中的javascript和其他代码网站上的内容。是否有更好的方法来筛选页面上的代码内容和实际的单词内容?

  if(isset($ _ POST ['url'])){
$ url = $ _POST ['url'];
$ str = strip_tags(file_get_contents($ url));
$ words = str_word_count(strtolower($ str),1);
$ word_count = array_count_values($ words);

foreach($ word_count as $ key => $ val){
$ density =($ val / count($ words))* 100;
echo$ key - COUNT:$ val,DENSITY:.number_format($ density,2)。%
> \\\
;
}
}


解决方案

I为此写了2个函数:

  / ** 
*删除所有由Html字符串提供的标签
*
* @param string $ str Html字符串
* @param字符串[] $ tagArr一个包含所有标记名的数组
*
* @return string没有标签的Html字符串
* /
函数removeTags($ str,$ tagArr)
{
foreach($ tagArr as $ tag){
$ str = preg_replace ('#<'。$ tag。'(。*?)>(。*?)< /'。$ tag。'> #is','',$ str);
}
返回$ str;

$ b $ **
清除一些html字符串
*
* @param string $ str一些html字符串
*
* @return string清理后的字符串
* /
函数filterHtml($ str)
{
//移除标签
$ str = removeTags($ str, ['script','style']);

//删除所有标签,但不包含内容
$ str = preg_replace('/< [>] *> /','',$ str);

//删除换行符和制表符
str = str_replace([\\\
,\t,\r],'',$ str);

//删除Double Whitespace
while(strpos($ str,'')!== false){
$ str = str_replace('','',$ str );
}

//返回修剪
返回修剪($ str);

$ / code>

工作示例

  $ fileContent = file_get_contents('http://stackoverflow.com/questions/25537377/filtering-html-from-site-content-php'); 
$ filteredContent = filterHtml($ fileContent);
var_dump($ filteredContent);


So I'm making a script to check the keyword density of a page based off the URL the user submits and I have been using strip_tags but it doesn't seem to be completely filtering the javascript and other code from the actual word content on the site. Is there a better way to filter between the code content on a page and the actual word content?

if(isset($_POST['url'])){
$url = $_POST['url'];
$str = strip_tags(file_get_contents($url));
$words      = str_word_count(strtolower($str),1);
$word_count = array_count_values($words);

foreach ($word_count as $key=>$val) {
    $density = ($val/count($words))*100;
        echo "$key - COUNT: $val, DENSITY: ".number_format($density,2)."%<br/>\n";
}
}

解决方案

I have written 2 functions for this:

/**
 * Removes all Tags provided from an Html string
 *
 * @param string   $str    The Html String
 * @param string[] $tagArr An Array with all Tag Names to be removed
 *
 * @return string The Html String without the tags
 */
function removeTags($str, $tagArr)
{
    foreach ($tagArr as $tag) {
        $str = preg_replace('#<' . $tag . '(.*?)>(.*?)</' . $tag . '>#is', '', $str);
    }
    return $str;
}

/**
 * cleans some html string
 *
 * @param string $str some html string
 *
 * @return string the cleaned string
 */
function filterHtml($str)
{
    //Remove Tags
    $str = removeTags($str, ['script', 'style']);

    //Remove all Tags, but not the Content
    $str = preg_replace('/<[^>]*>/', ' ', $str);

    //Remove Linebreaks and Tabs
    $str = str_replace(["\n", "\t", "\r"], ' ', $str);

    //Remove Double Whitespace
    while (strpos($str, '  ') !== false) {
        $str = str_replace('  ', ' ', $str);
    }

    //Return trimmed
    return trim($str);
}

Working Example

$fileContent     = file_get_contents('http://stackoverflow.com/questions/25537377/filtering-html-from-site-content-php');
$filteredContent = filterHtml($fileContent);
var_dump($filteredContent);

这篇关于从网站内容PHP过滤JavaScript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆