如何检测假用户（抓取工具）和cURL [英] How to detect fake users ( crawlers ) and cURL

查看：207 发布时间：2017/3/5 21:28:31 php curl spam-prevention

本文介绍了如何检测假用户（抓取工具）和cURL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

其他一些网站使用cURL和假的http引用来复制我的网站内容。
我们有什么办法检测cURL或不是真正的web浏览器？

解决方案

自动爬行。每个人都可以做，机器人也可以做到。只有解决方案才能使工作变得更困难，所以只有强大的技术怪才可能会尝试通过他们。

我几年前也有麻烦，我的第一个建议是，如果你有时间，自己是一个爬行器（我假设一个爬行者是抓住你的网站的家伙），这是最好的学校的主题。通过爬行几个网站，我学到了不同种类的保护，通过关联他们我一直高效。

我给你一些保护你可以尝试的例子。 p>

每个IP会话

如果用户每次使用50个新会话分钟，您可以认为此用户可能是不处理Cookie的抓取工具。当然，curl完全管理cookie，但是如果你将它与每个会话的访问计数器（稍后解释）耦合，或者如果你的crawler是一个有cookie的事情，它可能是有效的。

很难想象同一共享连接的50人会同时在您的网站上获得（当然取决于您的流量，这取决于您）。如果发生这种情况，您可以锁定您网站的网页，直到人机验证码填满。

创意：

1 ）你创建2个表：1保存禁止ips和1保存ip和会话

 创建表如果不存在sessions_per_ip b $ b ip int unsigned，
 session_id varchar（32），
创建时间戳默认current_timestamp，
主键（ip，session_id）
）; 
 
如果不存在则创建表banned_ips（
 ip int unsigned，
 create timestamp default current_timestamp，
 primary key（ip）
）;

2）在脚本开头，删除两个表格都太旧的条目

3）接下来检查您的用户的ip是否被禁止（您将一个标志设置为true）

4）如果没有，你计算他的会话为他的ip

5）如果他有太多的会话，你插入它的禁止表，并设置一个标志

6）如果尚未插入会话，则将其ip插入到每个ip表的会话中

代码示例以更好的方式显示我的想法。

 <？php 
 
 try 
 {
 
 //一些配置（演示的小值）
 $ max_sessions = 5; // 5 sessions / ip simultaneousely allowed 
 $ check_duration = 30; // 30秒sec在sessions_per_ip表上的ip的最大生命周期
 $ lock_duration = 60; // time to lock your website for this ip if max_sessions is reached 
 
 // Mysql连接
 require_once（config.php）; 
 $ dbh = new PDO（mysql：host = {$ host}; dbname = {$ base}，$ user，$ password）; 
 $ dbh-> setAttribute（PDO :: ATTR_ERRMODE，PDO :: ERRMODE_EXCEPTION）; 
 
 //删除表中的旧条目
 $ query =delete from sessions_per_ip where timestampdiff（second，creation，now（））> {$ check_duration}; 
 $ dbh-> exec（$ query）; 
 
 $ query =delete from banned_ips where timestampdiff（second，creation，now（））> {$ lock_duration}; 
 $ dbh-> exec（$ query）; 
 
 //获取附加到我们用户的有用信息... 
 session_start（）; 
 $ ip = ip2long（$ _ SERVER ['REMOTE_ADDR']）; 
 $ session_id = session_id（）; 
 
 //检查IP是否已被禁止
 $ banned = false; 
 $ count = $ dbh-> query（select count（*）from banned_ips where ip ='{$ ip}'） - > fetchColumn（）; 
 if（$ count> 0）
 {
 $ banned = true; 
} 
 else 
 {
 //为我们的数据库计数ip 
 $ query =select count（*）from sessions_per_ip where ip ='{$ ip}'; 
 $ count = $ dbh-> query（$ query） - > fetchColumn（）; 
 if（$ count> = $ max_sessions）
 {
 //为此ip锁定网站
 $ query =insert ignore into banned_ips（ip）values（'{$ ip}'）; 
 $ dbh-> exec（$ query）; 
 $ banned = true; 
} 
 
 //如果用户的会话尚未记录，则在数据库中插入一个新条目
 $ query =insert ignore into sessions_per_ip（ip，session_id）values（' $ ip}'，'{$ session_id}'）; 
 $ dbh-> exec（$ query）; 
} 
 
 //此时，如果您的用户被禁止，您将被禁止。 
 //以下代码将允许我们测试它... 
 
 //现在不显示任何内容，因为我们将使用会话：
 //该演示更可读，我喜欢一步一步像
 //这样。 
 ob_start（）; 
 
 //显示您当前的会话
 echo您当前的会话密钥是：< br />; 
 $ query =select session_id from sessions_per_ip where ip ='{$ ip}'; 
 foreach（$ dbh-> query（$ query）as $ row）{
 echo{$ row ['session_id']}< br />; 
} 
 
 //显示和处理一个创建新会话的方法
 echo str_repeat（'< br />'，2）; 
 echo'< a href ='。basename（__ FILE__）。'？new = 1>创建新的会话/重新载入< / a& 
 if（isset（$ _ GET ['new']））
 {
 session_regenerate_id（）; 
 session_destroy（）; 
 header（Location：。basename（__ FILE__））; 
 die（）; 
} 
 
 //显示是否被禁止
 echo str_repeat（'< br />'，2）; 
 if（$ banned）
 {
 echo'< span style =color：red;>您被禁止：等待60秒被解除禁用...验证码必须更多当然很友好！< / span>'; 
 echo'< br />'; 
 echo'< img src =http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png/>'; 
} 
 else 
 {
 echo'< span style =color：blue;>您未被禁止！< / span>'; 
 echo'< br />'; 
 echo'< img src =http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png/>'; 
} 
 ob_end_flush（）; 
} 
 catch（PDOException $ e）
 {
 / * echo * / $ e-> getMessage 
} 
 
？>

访问计数器

如果您的用户使用相同的Cookie抓取您的网页，您可以使用他的会话来阻止它。这个想法很简单：您的用户可能在60秒内访问60个网页吗？

建议：

在用户会话中创建一个数组，它将包含访问时间。

删除此数组中超过X秒的访问

为实际访问添加新条目

计算此数组中的条目

如果用户访问了Y页， / li>

示例代码：

  ;？php 
 
 $ visit_counter_pages = 5; //最大加载页数
 $ visit_counter_secs = 10; //清除访问前的最大时间量
 
 session_start（）; 
 
 //为访问计数器初始化数组
 if（array_key_exists（'visit_counter'，$ _SESSION）== false）
 {
 $ _SESSION ['visit_counter '] = array（）; 
} 
 
 //清除旧访问
 foreach（$ _SESSION ['visit_counter'] as $ key => $ time）
 {
 if （（time（） -  $ time）> $ visit_counter_secs）{
 unset（$ _ SESSION ['visit_counter'] [$ key]）; 
} 
} 
 
 //将当前访问添加到我们的数组中
 $ _SESSION ['visit_counter'] [] = time（）; 
 
 //检查用户是否已达到访问页数限制
 $ banned = false; 
 if（count（$ _ SESSION ['visit_counter']）> $ visit_counter_pages）
 {
 //将我们的用户的ip放在同一个禁止的表 b 
anned = true; 
} 
 
 //此时，如果您的用户被禁止，您将被禁止。 
 //以下代码将允许我们测试它... 
 
 echo'< script type =text / javascriptsrc =https://ajax.googleapis.com /ajax/libs/jquery/1.6.2/jquery.min.js\"</script>'; 
 
 //显示计数器
 $ count = count（$ _ SESSION ['visit_counter']）; 
 echo您访问过{$ count}页。 
 echo str_repeat（'< br />'，2）; 
 
 echo<<< EOT 
 
< a id =reloadhref =＃>重新载入< / a> 
 
< script type =text / javascript> 
 
 $（'＃reload'）。click（function（e）{
 e.preventDefault（）; 
 window.location.reload（）; 
} ）; 
 
< / script> 
 
 EOT; 
 
 echo str_repeat（'< br />'，2）; 
 
 //显示是否被禁止
 echo str_repeat（'< br />'，2）; 
 if（$ banned）
 {
 echo'< span style =color：red;>稍等片刻（此演示中为10秒）...< / span>'; 
 echo'< br />'; 
 echo'< img src =http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png/>'; 
} 
 else 
 {
 echo'< span style =color：blue;>您尚未禁止！< / span>'; 
 echo'< br />'; 
 echo'< img src =http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png/>'; 
} 
？>

 
 
 要下载的图片
 
 
 当爬虫需要做他的肮脏的工作，这是大量的数据，并在最短的时间。这就是为什么他们不会在网页上下载图片;它需要太多的带宽，并使爬行更慢。
 
 
 这个想法（我认为最聪明，最容易实现）使用 mod_rewrite 隐藏.jpg / .png / ...图像文件中的代码。此图片应显示在您要保护的每个网页上：它可能是您的徽标网站，但您会选择一个小尺寸的图片（因为此图片不得缓存）。
 
 
 建议：
 
 
  1 /将这些行添加到.htaccess

 code> RewriteEngine On 
 RewriteBase / tests / anticrawl / 
 RewriteRule ^ logo\.jpg $ logo.php

2 /使用安全性创建您的logo.php

 <？php 
 
 //启动会话和重置计数器
 session_start（）; 
 $ _SESSION ['no_logo_count'] = 0; 
 
 //强制镜像下次重新加载$ b $ b头（Cache-Control：no-store，no-cache，must-revalidate）; 
 
 //显示图片
标题（Content-type：image / jpg）; 
 readfile（logo.jpg）; 
 die（）;

3 /增加您需要添加安全性的每个页面上的no_logo_count，并检查是否达到了您的限制。

示例代码：

 <？php 
 
 $ no_logo_limit = 5; //没有标志的allowd页面数量
 
 //开始会话并初始化
 session_start（）; 
 if（array_key_exists（'no_logo_count'，$ _SESSION）== false）
 {
 $ _SESSION ['no_logo_count'] = 0; 
} 
 else 
 {
 $ _SESSION ['no_logo_count'] ++; 
} 
 
 //检查用户是否已达到未下载映像的限制
 $ banned = false; 
 if（$ _SESSION ['no_logo_count']> = $ no_logo_limit）
 {
 //将我们的用户的ip放在同一个禁止的表 b $ banned = true; 
} 
 
 //此时，如果您的用户被禁止，您将被禁止。 
 //以下代码将允许我们测试它... 
 
 echo'< script type =text / javascriptsrc =https://ajax.googleapis.com /ajax/libs/jquery/1.6.2/jquery.min.js\"</script>'; 
 
 //显示计数器
 echo您没有加载图像{$ _SESSION ['no_logo_count']}次。 
 echo str_repeat（'< br />'，2）; 
 
 //显示reload链接
 echo<<< EOT 
 
< a id =reloadhref =＃>重新载入< / a> 
 
< script type =text / javascript> 
 
 $（'＃reload'）。click（function（e）{
 e.preventDefault（）; 
 window.location.reload（）; 
} ）; 
 
< / script> 
 
 EOT; 
 
 echo str_repeat（'< br />'，2）; 
 
 //显示show image链接：注意，我们使用.jpg文件
 echo<< EOT 
 
< div id =image_container> 
< a id =image_loadhref =＃>加载图片< / a> 
< / div> 
< br /> 
 
< script type =text / javascript> 
 
 //在你的实现上，你当然会使用< img src =logo.jpg/> 
 $（'＃image_load'）。click（function（e）{
 e.preventDefault（）; 
 $（'＃image_load'）。html（'< img src = logo.jpg/>'）; 
}）; 
 
< / script> 
 
 EOT; 
 
 //显示是否被禁止
 echo str_repeat（'< br />'，2）; 
 if（$ banned）
 {
 echo'< span style =color：red;>您已禁用：点击加载图片并重新加载...< / span>'; 
 echo'< br />'; 
 echo'< img src =http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png/>'; 
} 
 else 
 {
 echo'< span style =color：blue;>您未被禁止！< / span>'; 
 echo'< br />'; 
 echo'< img src =http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png/>'; 
} 
？>

Cookie检查

您可以在javascript侧创建Cookie，以检查您的用户是否解释javascript（例如，使用Curl的抓取工具不会）。

将$ _SESSION值设置为1，并在每个值中增加它访问

如果存在Cookie（在JavaScript中设置），则将会话值设置为0

如果此值达到限制， / li>

代码：

  ？php 
 
 $ no_cookie_limit = 5; //没有cookie设置检查的allowd页的数量
 
 //开始会话和重置计数器
 session_start（）; 
 
 if（array_key_exists（'cookie_check_count'，$ _SESSION）== false）
 {
 $ _SESSION ['cookie_check_count'] = 0; 
} 
 
 //初始化cookie（注意：重命名为更离散的名称）或检查cookie值
 if（（array_key_exists（'cookie_check'，$ _COOKIE） == false）||（$ _COOKIE ['cookie_check']！= 42））
 {
 // Cookie不存在或不正确... 
 $ _SESSION ['cookie_check_count' ] ++; 
} 
 else 
 {
 // Cookie已正确设置，因此我们重置计数器
 $ _SESSION ['cookie_check_count'] = 0; 
} 
 
 //检查用户是否已达到cookie check的限制
 $ banned = false; 
 if（$ _SESSION ['cookie_check_count']> = $ no_cookie_limit）
 {
 //将我们的用户的ip放在同一个禁止的表 b $ banned = true; 
} 
 
 //此时，如果您的用户被禁止，您将被禁止。 
 //以下代码将允许我们测试它... 
 
 echo'< script type =text / javascriptsrc =https://ajax.googleapis.com /ajax/libs/jquery/1.6.2/jquery.min.js\"</script>'; 
 
 //显示计数器
 echoCookie检查失败{$ _SESSION ['cookie_check_count']}次。 
 echo str_repeat（'< br />'，2）; 
 
 //显示reload链接
 echo<<< EOT 
 
< br /> 
< a id =reloadhref =＃>重新载入< / a> 
< br /> 
 
< script type =text / javascript> 
 
 $（'＃reload'）。click（function（e）{
 e.preventDefault（）; 
 window.location.reload（）; 
} ）; 
 
< / script> 
 
 EOT; 
 
 //显示set cookie链接
 echo<<< EOT 
 
< br /> 
< a id =cookie_linkhref =＃>设置Cookie< / a> 
< br /> 
 
< script type =text / javascript> 
 
 //在你的实现中，你将cookie设置在$（document）.ready（）
 $（'＃cookie_link'）。click ）{
 e.preventDefault（）; 
 var expires = new Date（）; 
 expires.setTime（new Date（）。getTime（）+ 3600000）; 
 document。 cookie =cookie_check = 42; expires =+ expires.toGMTString（）; 
}）; 
 
< / script> 
 EOT; 
 
 
 //显示unset cookie链接
 echo<<< EOT 
 
< br /> 
< a id =unset_cookiehref =＃>取消设置Cookie< / a> 
< br /> 
 
< script type =text / javascript> 
 
 //在你的实现中，你将cookie设置在$（document）.ready（）
 $（'＃unset_cookie'）。click ）{
 e.preventDefault（）; 
 document.cookie =cookie_check =; expires = Thu，01 Jan 1970 00:00:01 GMT; 
}）; 
 
< / script> 
 EOT; 
 
 //显示是否被禁止
 echo str_repeat（'< br />'，2）; 
 if（$ banned）
 {
 echo'< span style =color：red;>您已禁用：点击设置Cookie，然后重新加载...< / span>'; 
 echo'< br />'; 
 echo'< img src =http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png/>'; 
} 
 else 
 {
 echo'< span style =color：blue;>您未被禁止！< / span>'; 
 echo'< br />'; 
 echo'< img src =http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png/>'; 
}

防止代理< h1>

有关我们在网络上可能找到的不同类型代理的一些话：

正常代理显示有关用户连接（特别是其IP）的信息

匿名代理不显示IP，但提供有关代理使用情况的信息。

高匿名代理不显示用户IP，并且不显示浏览器可能不会发送的任何信息。

很容易找到一个代理来连接任何网站，但很难找到高匿名代理。

某些$ _SERVER变量可能包含键，特别是如果您的用户位于代理之后（详尽列表取自此问题）：

CLIENT_IP

FORWARDED

FORWARDED_FOR

FORWARDED_FOR_IP

HTTP_CLIENT_IP

HTTP_FORWARDED

HTTP_FORWARDED_FOR

HTTP_FORWARDED_FOR_IP

HTTP_PC_REMOTE_ADDR

HTTP_PROXY_CONNECTION'

HTTP_VIA

HTTP_X_FORWARDED

HTTP_X_FORWARDED_FOR

HTTP_X_FORWARDED_FOR_IP

HTTP_X_IMFORWARDS

HTTP_XROXY_CONNECTION

VIA

X_FORWARDED

X_FORWARDED_FOR

您可能会对您的反如果您在您的 `$ _ SERVER` 变量中检测到其中一个键。

结论

有很多方法可以检测您的网站上的滥用行为，因此您会找到一个解决方案。但是你需要准确地知道你的网站是如何使用的，所以你的证券不会与你的正常用户积极。

Some other website use cURL and fake http referer to copy my website content. Do we have any way to detect cURL or not real web browser ?

解决方案

There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.

I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.

I give you some examples of protections you may try.

Sessions per IP

If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.

It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.

Idea :

1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions

create table if not exists sessions_per_ip (
  ip int unsigned,
  session_id varchar(32),
  creation timestamp default current_timestamp,
  primary key(ip, session_id)
);

create table if not exists banned_ips (
  ip int unsigned,
  creation timestamp default current_timestamp,
  primary key(ip)
);

2) at the beginning of your script, you delete entries too old from both tables

3) next you check if ip of your user is banned or not (you set a flag to true)

4) if not, you count how much he has sessions for his ip

5) if he has too much sessions, you insert it in your banned table and set a flag

6) you insert his ip on the sessions per ip table if it has not been already inserted

I wrote a code sample to show in a better way my idea.

<?php

try
{

    // Some configuration (small values for demo)
    $max_sessions = 5; // 5 sessions/ip simultaneousely allowed
    $check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
    $lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

    // Mysql connection
    require_once("config.php");
    $dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

    // Delete old entries in tables
    $query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
    $dbh->exec($query);

    $query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
    $dbh->exec($query);

    // Get useful info attached to our user...
    session_start();
    $ip = ip2long($_SERVER['REMOTE_ADDR']);
    $session_id = session_id();

    // Check if IP is already banned
    $banned = false;
    $count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
    if ($count > 0)
    {
        $banned = true;
    }
    else
    {
        // Count entries in our db for this ip
        $query = "select count(*)  from sessions_per_ip where ip = '{$ip}'";
        $count = $dbh->query($query)->fetchColumn();
        if ($count >= $max_sessions)
        {
            // Lock website for this ip
            $query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
            $dbh->exec($query);
            $banned = true;
        }

        // Insert a new entry on our db if user's session is not already recorded
        $query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
        $dbh->exec($query);
    }

    // At this point you have a $banned if your user is banned or not.
    // The following code will allow us to test it...

    // We do not display anything now because we'll play with sessions :
    // to make the demo more readable I prefer going step by step like
    // this.
    ob_start();

    // Displays your current sessions
    echo "Your current sessions keys are : <br/>";
    $query = "select session_id from sessions_per_ip where ip = '{$ip}'";
    foreach ($dbh->query($query) as $row) {
        echo "{$row['session_id']}<br/>";
    }

    // Display and handle a way to create new sessions
    echo str_repeat('<br/>', 2);
    echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
    if (isset($_GET['new']))
    {
        session_regenerate_id();
        session_destroy();
        header("Location: " . basename(__FILE__));
        die();
    }

    // Display if you're banned or not
    echo str_repeat('<br/>', 2);
    if ($banned)
    {
        echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
        echo '<br/>';
        echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
    }
    else
    {
        echo '<span style="color:blue;">You are not banned!</span>';
        echo '<br/>';
        echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
    }
    ob_end_flush();
}
catch (PDOException $e)
{
    /*echo*/ $e->getMessage();
}

?>

Visit Counter

If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?

Idea :

Create an array in the user session, it will contains visit time()s.
Remove visits older than X seconds in this array
Add a new entry for the actual visit
Count entries in this array
Ban your user if he visited Y pages

Sample code :

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
    $_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
    if ((time() - $time) > $visit_counter_secs) {
        unset($_SESSION['visit_counter'][$key]);
    }
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

An image to download

When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.

This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).

Idea :

1/ Add those lines to your .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ Create your logo.php with the security

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.

Sample code :

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
    $_SESSION['no_logo_count'] = 0;
}
else
{
    $_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
    <a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

  // On your implementation, you'llO of course use <img src="logo.jpg" />
  $('#image_load').click(function(e) {
    e.preventDefault();
    $('#image_load').html('<img src="logo.jpg" />');
  });

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

Cookie check

You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).

The idea is quite simple : this is about the same as an image check.

Set a $_SESSION value to 1 and increment it in each visits
if a cookie (set in JavaScript) does exist, set session value to 0
if this value reached a limit, ban your user

Code :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
    $_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
    // Cookie does not exist or is incorrect...
    $_SESSION['cookie_check_count']++;
}
else
{
    // Cookie is properly set so we reset counter
    $_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
    // puts ip of our user on the same "banned table" as earlier...
    $banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

  $('#reload').click(function(e) {
    e.preventDefault();
    window.location.reload();
  });

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#cookie_link').click(function(e) {
    e.preventDefault();
    var expires = new Date();
    expires.setTime(new Date().getTime() + 3600000);
    document.cookie="cookie_check=42;expires=" + expires.toGMTString();
  });

</script>
EOT;


// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

  // On your implementation, you'll of course put the cookie set on a $(document).ready()
  $('#unset_cookie').click(function(e) {
    e.preventDefault();
    document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
  });

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
    echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
    echo '<br/>';
    echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
    echo '<span style="color:blue;">You are not banned!</span>';
    echo '<br/>';
    echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}

Protection against proxies

Some words about the different kind of proxies we may find over the web :

A "normal" proxy displays information about user connection (notably, his IP)
An anonymous proxy does not display IP, but gives information about proxy usage on header.
A high-anonyous proxy do not display user IP, and do not display any information that a browser may not send.

It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.

Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):

CLIENT_IP
FORWARDED
FORWARDED_FOR
FORWARDED_FOR_IP
HTTP_CLIENT_IP
HTTP_FORWARDED
HTTP_FORWARDED_FOR
HTTP_FORWARDED_FOR_IP
HTTP_PC_REMOTE_ADDR
HTTP_PROXY_CONNECTION'
HTTP_VIA
HTTP_X_FORWARDED
HTTP_X_FORWARDED_FOR
HTTP_X_FORWARDED_FOR_IP
HTTP_X_IMFORWARDS
HTTP_XROXY_CONNECTION
VIA
X_FORWARDED
X_FORWARDED_FOR

You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your $_SERVER variable.

Conclusion

There is a lot of ways to detect abuses on your website, so you'll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your "normal" users.

这篇关于如何检测假用户（抓取工具）和cURL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何检测假用户（抓取工具）和cURL [英] How to detect fake users ( crawlers ) and cURL

问题描述

每个IP会话

访问计数器

要下载的图片

Cookie检查

结论

Sessions per IP

Visit Counter

An image to download

Cookie check

Protection against proxies

Conclusion

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

如何检测假用户（抓取工具）和cURL [英] How to detect fake users ( crawlers ) and cURL

问题描述

每个IP会话

访问计数器

要下载的图片

Cookie检查

结论

Sessions per IP

Visit Counter

An image to download

Cookie check

Protection against proxies

Conclusion

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭