Goutte不会加载ASP SSL页面 [英] Goutte won't load an ASP SSL page

查看:143
本文介绍了Goutte不会加载ASP SSL页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试 Goutte ,这是一款基于Symfony2组件的PHP网络抓取工具。我已经成功地以纯文本和SSL形式检索了Google。但是,我遇到了一个无法加载的ASP / SSL页面。

I am trying out Goutte, the PHP web crawler based on Symfony2 components. I've successfully retrieved Google in both plaintext and SSL forms. However, I've come across an ASP/SSL page that won't load.

这是我的代码:

// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';

// Here's a demo of a page we want to parse
$uri = '(removed)';

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";

相反,对于这个网站,上面代码末尾的回显给了我:

Instead, the echo at the end of the above code, for this one site, gives me this:


错误请求(标题名称无效)

Bad Request (Invalid Header Name)

我可以在Firefox中看到网站很好,并且可以使用 wget --no-check-certificate 检索其HTML,没有其他选项(设置标题)或者用户代理,例如)。

I can see the site fine in Firefox, and the HTML for it can be retrieved fine using wget --no-check-certificate with no other options (setting the header or user agent, for example).

我怀疑我需要在Goutte中设置一些HTTP头。有没有人想我应该尝试哪些?

I suspect I need to set some HTTP headers in Goutte. Has anyone any ideas which ones I should try?

推荐答案

我发现我的浏览器和 wget 都在标题中添加了一个非空的用户代理字段,所以我假设Goutte在这里没有设置任何内容。在获取之前将此标头添加到浏览器对象可以解决问题:

I discovered that my browser and wget both add a non-empty user agent field in the header, so I am assuming Goutte sets nothing here. Adding this header to the browser object prior to the fetch fixes the problem:

// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';

// Here's a demo of a page we want to parse
$uri = '(removed)';

use Goutte\Client;

// Set up headers
$client = new Client();
$headers = array(
    'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:21.0) Gecko/20100101 Firefox/21.0',
);
foreach ($headers as $header => $value)
{
    $client->setHeader($header, $value);
}

$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";

这里我已经在我的浏览器代理字符串中复制了,但在这种情况下我认为一切都会有效 - 只要它被设置。

Here I've copied in my browser agent string, but in this case I think anything would work - as long as it is set.

顺便说一下,我在这里使用浏览器UA,因为我试图准确地复制浏览器环境以调试这个特定问题。一旦它工作,我切换到自定义UA,所以目标网站可以检测它作为机器人,如果他们希望(对于这个项目,我认为没有人)。

Incidentally, I used a browser UA here as I was trying to accurately replicate the browser environment for debugging this particular problem. Once it worked I switched to a custom UA, so target sites can detect it as a bot if they wish to (for this project I don't think anyone has).

这篇关于Goutte不会加载ASP SSL页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆