php crawl - 启用 javascript [英] php crawl - javascript enabled

查看:20
本文介绍了php crawl - 启用 javascript的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Bonjour,有谁知道一种创建蜘蛛的方法,就像启用了 javascript 一样?

Bonjour, does anyone know of a way of creating a spider that acts as if it has javascript enabled?

PHP 代码:

file_get_contents("http://www.google.co.uk/search?hl=en&q=".$keyword."&start=".($x*10)."&sa=N") 

它将检索该页面的输出.如果你用过,PHP代码:

it would retrieve the output of that page. If you used, PHP Code:

file_get_contents("http://www.facebook.com/something/something.something.php") 
(im not sure i just know face book is a good example)

它会返回 trhe 输出,我猜它会包含类似你必须启用 javascript 才能继续"这样的内容,因为它是一个 javascript 操作的站点(不可访问).

it would return trhe output, which im guessing would include something along the lines of "you must have javascript enabled to continue" because it is a javascript operated site (not accessible).

PHP代码:刚刚检查

$link = "http://www.facebook.com/index.php";
$contents = file_get_contents($link);
echo $contents;

返回:您使用的网络浏览器不兼容.

returns: You are using an incompatible web browser.

抱歉,不够酷,无法支持您的浏览器.请使用以下浏览器之一保持真实:

Sorry, were not cool enough to support your browser. Please keep it real with one of the following browsers:

* Mozilla Firefox
* Safari
* Microsoft Internet Explorer

我通过上述所有浏览器进行了测试?

which i tested through all the above browsers ?

推荐答案

显然,在这种特定情况下,Facebook 仅测试 HTTP 标头User-Agent".

Apparently, in this specific case, Facebook is only testing for the HTTP Header "User-Agent".

如果我使用这部分代码,基于 curl,它允许我设置很多选项,使用 curl_setopt :

If I'm using this portion of code, based on curl, which allows me to set a lot of optons, using curl_setopt :

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.facebook.com/index.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
curl_close($ch);
echo $html;

我收到的信息和你一样.

I get the same message as you do.


但是,如果我尝试发送与 Firefox 对应的 User-Agent(我只是复制粘贴了我真正的 Firefox 实际发送的那个):


But, if I try sending a User-Agent that correspond to Firefox (I just copy-pasted the one my real Firefox is actually sending) :

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.facebook.com/index.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3");
$html = curl_exec($ch);
curl_close($ch);
echo $html;

我得到了真正的 Facebook 主页,而不是关于浏览器不兼容的错误消息.

I get the real Facebook homepage, and not that error message about incompatible browser.


当然,这并不能解决Javascript不被执行的问题...


Of course, this will not solve the problem of Javascript not being executed...

...但是在没有浏览器的情况下执行Javascript是一件相当困难的事情(甚至谷歌都没有解决它^^)

... But executing Javascript without a browser is quite a difficult thing (not even google solved it ^^ )

有些引擎允许在没有浏览器的情况下运行 Javascript 代码(例如犀牛;或 Spidermonkey PECL 扩展,适用于 PHP) ;但即使它们允许您运行 Javascript 代码,您也不会拥有浏览器提供的所有环境和方法,而这些环境和方法是网站所依赖的...

There are engines that allow to run Javascript code without a browser (rhino, for instance ; or the Spidermonkey PECL extension, for PHP) ; but even if they allow you to run Javascript code, you will not have all the environment and methods that are provided by the browser, on which websites rely...


一个想法,如果您需要抓取依赖于 Javascript 的网站,可能是使用 Selenium,它会打开一个真正的浏览器(即 firefox 或其他),通过 从您的 PHP 代码控制它硒 RC.


An idea, if you need to crawl a Javascript-dependant website, might be to use Selenium, which opens a real browser (ie, firefox, or other), controling it from your PHP code via Selenium RC.

但这意味着您的 PHP 机器上必须有图形环境和浏览器;这也很重而且很慢——比加载网页慢很多^^

But that means you must have a graphical environment, and a browser, on you PHP machine ; this is also quite heavy and slow -- a lot slower than just loading a webpage ^^

这篇关于php crawl - 启用 javascript的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆