屏幕刮JS页面 [英] Screen scraping JS page

查看:207
本文介绍了屏幕刮JS页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试浏览此页面 http:// www。 buddytv.com/trivia/game-of-thrones-trivia.aspx 和它不工作。



我试过

  $ html = new simple_html_dom(); 
$ html-> load_file($ url);

但是,我正在寻找的问题(.trivia问题)无法找到。有没有人可以告诉我我在做错什么?



非常感谢!



p>

 < PHP 
$页=的file_get_contents('http://www.buddytv.com/trivia/game-的-宝座-trivia.aspx');
$ dom_document = new DOMDocument();
//错误抑制,因为它是由于不匹配的html标签引起的错误
@ $ dom_document-> loadHTML($ Page);
$ dom_xpath_admin = new DOMXpath($ dom_document_admin);
$ elements = $ dom_xpath-> query('// * [@ id =id60questionText]');
var_dump($ elements);


解决方案

好的,这里是phantomjs的例子:



您需要从: http://phantomjs.org/ 下载phantomjs通过运行{installationdir} / bin / phantomjs(Windows上的phantomjs.exe)来测试它。--version



p>

然后在您的项目中的某个地方创建JS文件,如browser.js

  var page = require('webpage')。create(); 

page.open( 'http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx',函数(){

页面。 includeJs( http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js,函数(){

搜索= page.evaluate(函数( ){
return $('#id60questionText')。text();
});

console.log(search);

幻影.exit()
});
})

然后在你的PHP脚本读取如下:

  $ pathToPhatomJs ='/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux- x86_64的/ bin中/ phantomjs'; 

$ pathToJsScript ='/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js';

$ stdOut = exec(sprintf('%s%s',$ pathToPhatomJs,$ pathToJsScript),$ out);

echo $ stdOut;

更改 $ pathToPhatomJs $ pathToJsScript 根据您的配置。



如果您在Windows上可能无法正常工作。然后,您可以将PHP脚本更改为:

  $ pathToPhatomJs ='/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1 -linux-x86_64的/ bin中/ phantomjs'; 

$ pathToJsScript =/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js’;

exec(sprintf('%s%s> phatom.txt',$ pathToPhatomJs,$ pathToJsScript),$ out);

$ fileContents = file_get_contents('phatom.txt');

echo $ fileContents;


I'm trying to scrape this page http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx and it's not working.

I tried

$html = new simple_html_dom();
  $html->load_file($url);

But for the question I'm looking to grab (.trivia-question) can't be found. Can anybody tell me what I'm doing wrong ?

Thanks a lot!

And I tried

  <?php
  $Page = file_get_contents('http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx');
  $dom_document = new DOMDocument();
  //errors suppress because it is throwing errors due to mismatched html tags
  @$dom_document->loadHTML($Page);
  $dom_xpath_admin = new DOMXpath($dom_document_admin);
  $elements = $dom_xpath->query('//*[@id="id60questionText"]');
  var_dump($elements);

解决方案

Ok then here is phantomjs example:

You need to download phantomjs from: http://phantomjs.org/, put somewhere where you can easily access by a script.

Test it by running {installationdir}/bin/phantomjs (phantomjs.exe on windows) --version

Then create JS file somewhere in your project, ex browser.js

var page = require('webpage').create();

page.open('http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx', function() {

page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {

    search = page.evaluate(function() { 
        return  $('#id60questionText').text();
    });

    console.log(search);

    phantom.exit()
  });
})

Then in your PHP script read it like:

$pathToPhatomJs = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/bin/phantomjs';

$pathToJsScript = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js';

$stdOut = exec(sprintf('%s %s', $pathToPhatomJs,  $pathToJsScript), $out);

echo $stdOut;

Change $pathToPhatomJs and $pathToJsScript according to your configuration.

If you are on windows this may not work. You can then change PHP script to:

$pathToPhatomJs = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/bin/phantomjs';

$pathToJsScript = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js';

exec(sprintf('%s %s > phatom.txt', $pathToPhatomJs,  $pathToJsScript), $out);

$fileContents = file_get_contents('phatom.txt');

echo $fileContents;

这篇关于屏幕刮JS页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆