刮刮尝试收到403错误 [英] Scraping attempts getting 403 error

查看:92
本文介绍了刮刮尝试收到403错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

无论我如何尝试,我都试图抓取一个网站,并且遇到403禁止错误:

I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:

  1. wget
  2. CURL(命令行和PHP)
  3. Perl WWW :: Mechanize
  4. PhantomJS

我尝试了以上所有方法(有无代理),更改用户代理并添加引荐来源标头.

I tried all of the above with and without proxies, changing user-agent, and adding a referrer header.

我什至从Chrome浏览器中复制了请求标头,并尝试使用PHP Curl发送请求,但仍然收到403 Forbidden错误.

I even copied the request header from my Chrome browser and tried sending with my request using PHP Curl and I am still getting a 403 Forbidden error.

关于触发网站阻止请求以及如何绕过该内容的任何输入或建议?

Any input or suggestions on what is triggering the website to block the request and how to bypass?

PHP CURL示例:

PHP CURL Example:

$url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=1510475982858';
$headers = array(
            'accept:application/json, text/javascript, */*; q=0.01',
            'accept-encoding:gzip, deflate, br',
            'accept-language:en-US,en;q=0.9',               
            'referer:https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands:quadblock:supplements',
            'user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
            'x-requested-with:XMLHttpRequest',
);

$res = curl_get($url,$headers);
print $res;
exit;

function curl_get($url,$headers=array(),$useragent=''){ 
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);           
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);   
    curl_setopt($curl, CURLOPT_ENCODING, '');            
    if($useragent)curl_setopt($curl, CURLOPT_USERAGENT,$useragent);             
    if($headers)curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);

    $response = curl_exec($curl);       

    $header_size = curl_getinfo($curl, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size);
    $response = substr($response, $header_size);


    curl_close($curl);  
    return $response;
 }

这是我总是得到的答复:

And here is the response I always get:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access     

  "http&#58;&#47;&#47;www&#46;vitacost&#46;com&#47;productResults&#46;aspx&#63;" 
on this server.<P>
Reference&#32;&#35;18&#46;55f50717&#46;1510477424&#46;2a24bbad
</BODY>
</HTML>

推荐答案

首先,请注意,该网站不喜欢Web抓取.正如@KeepCalmAndCarryOn在评论中指出的那样,该网站在明确要求的位置有/robots.txt 漫游器不要抓取网站的特定部分,包括您要抓取的部分.在没有法律约束力的情况下,良好的公民将坚持这一要求.

First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.

此外,该网站似乎还采取了明确的保护措施,以防止刮擦,并试图确保它确实是浏览器.看起来该网站位于Akamai CDN的后面,因此防刮擦保护可能来自此CDN.

Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.

但是我已经接受了Firefox发送的请求(有效),然后尝试尽可能简化它.以下内容目前对我有效,但是如果该网站更新其浏览器检测功能,则当然可能会失败:

But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:

use strict;
use warnings;
use IO::Socket::SSL;

(my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
Host: www.vitacost.com
Accept: */*
Accept-Language: en-US
Connection: keep-alive

RQ

my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
print $cl $rq;
my $hdr = '';
while (<$cl>) {
    $hdr .= $_;
    last if $_ eq "\r\n";
}
warn "[header done]\n";
my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
read($cl,my $buf,$len);
print $buf;

有趣的是,如果删除Accept标头,则会显示403 Forbidden.如果我改为删除Accept-Language,它只会挂起.而且有趣的是,它似乎不需要User-Agent标头.

Interestingly, if I remove the Accept header I get a 403 Forbidden. If I instead remove the Accept-Language it simply hangs. And also interestingly it does not seem to need a User-Agent header.

看来,自动检测程序也使用发送方的源IP作为功能.上面的代码在两个不同的系统上对我有用,但在第三个系统(托管在Digitalocean上)上却无法工作,并且挂起了.

it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.

这篇关于刮刮尝试收到403错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆