使用javascript(phantomjs)导航/抓取hashbang链接 [英] Navigating / scraping hashbang links with javascript (phantomjs)

查看:122
本文介绍了使用javascript(phantomjs)导航/抓取hashbang链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下载几乎完全由JavaScript生成的网站的HTML。所以,我需要模拟浏览器访问,并且一直在使用 PhantomJS 。问题是,该网站使用hashbang网址,我似乎无法让PhantomJS处理hashbang - 它只是不断调用主页。

I'm trying to download the HTML of a website that is almost entirely generated by JavaScript. So, I need to simulate browser access and have been playing around with PhantomJS. Problem is, the site uses hashbang URLs and I can't seem to get PhantomJS to process the hashbang -- it just keeps calling up the homepage.

该网站是< a href =http://www.regulations.gov =nofollow noreferrer> http://www.regulations.gov 。默认将您带到#!home。我尝试使用以下代码(来自此处)来尝试处理不同的hashbangs 。

The site is http://www.regulations.gov. The default takes you to #!home. I've tried using the following code (from here) to try and process different hashbangs.

if (phantom.state.length === 0) {
     if (phantom.args.length === 0) {
        console.log('Usage: loadreg_1.js <some hash>');
        phantom.exit();
     }
     var address = 'http://www.regulations.gov/';
     console.log(address);
     phantom.state = Date.now().toString();
     phantom.open(address);

} else {
     var hash = phantom.args[0];
     document.location = hash;
     console.log(document.location.hash);
     var elapsed = Date.now() - new Date().setTime(phantom.state);
     if (phantom.loadStatus === 'success') {
             if (!first_time) {
                     var first_time = true;
                     if (!document.addEventListener) {
                             console.log('Not SUPPORTED!');
                     }
                     phantom.render('result.png');
                     var markup = document.documentElement.innerHTML;
                     console.log(markup);
                     phantom.exit();
             }
     } else {
             console.log('FAIL to load the address');
             phantom.exit();
     }
}

此代码生成正确的hashbang(例如,I可以将哈希设置为'#!contactus'),但它不会动态生成任何不同的HTML - 只是默认页面。但是,当我调用 document.location.hash 时,它确实正确输出。

This code produces the correct hashbang (for instance, I can set the hash to '#!contactus') but it doesn't dynamically generate any different HTML--just the default page. It does, however, correctly output that has when I call document.location.hash.

我也是试图将初始地址设置为hashbang,但是脚本只是挂起而不做任何事情。例如,如果我将网址设置为 http://www.regulations.gov/#!searchResults;rpp=10; po = 0 ,则脚本会在打印后挂起地址到终端,什么都没发生。

I've also tried to set the initial address to the hashbang, but then the script just hangs and doesn't do anything. For example, if I set the url to http://www.regulations.gov/#!searchResults;rpp=10;po=0 the script just hangs after printing the address to the terminal and nothing ever happens.

推荐答案

这里的问题是页面的内容是异步加载的,但你是'一旦页面加载,我希望它可用。

The issue here is that the content of the page loads asynchronously, but you're expecting it to be available as soon as the page is loaded.

为了抓取异步加载内容的页面,您需要等待直到您感兴趣的内容被加载为止。根据页面的不同,可能会有不同的检查方式,但最简单的方法是定期检查您希望看到的内容,直到找到它为止。

In order to scrape a page that loads content asynchronously, you need to wait to scrape until the content you're interested in has been loaded. Depending on the page, there might be different ways of checking, but the easiest is just to check at regular intervals for something you expect to see, until you find it.

这里的技巧是找出要查找的内容 - 在加载所需内容之前,您需要在页面上不存在的内容。在这种情况下,我为顶级页面找到的最简单的选项是手动输入您希望在每个页面上看到的H1标签,并将它们键入哈希:

The trick here is figuring out what to look for - you need something that won't be present on the page until your desired content has been loaded. In this case, the easiest option I found for top-level pages is to manually input the H1 tags you expect to see on each page, keying them to the hash:

var titleMap = {
    '#!contactUs': 'Contact Us',
    '#!aboutUs': 'About Us'
    // etc for the other pages
};

然后在您的成功块中,您可以设置定期超时以查找您想要的标题 h1 标记。当它出现时,你知道你可以呈现页面:

Then in your success block, you can set a recurring timeout to look for the title you want in an h1 tag. When it shows up, you know you can render the page:

if (phantom.loadStatus === 'success') {
    // set a recurring timeout for 300 milliseconds
    var timeoutId = window.setInterval(function () {
        // check for title element you expect to see
        var h1s = document.querySelectorAll('h1');
        if (h1s) {
            // h1s is a node list, not an array, hence the
            // weird syntax here
            Array.prototype.forEach.call(h1s, function(h1) {
                if (h1.textContent.trim() === titleMap[hash]) {
                    // we found it!
                    console.log('Found H1: ' + h1.textContent.trim());
                    phantom.render('result.png');
                    console.log("Rendered image.");
                    // stop the cycle
                    window.clearInterval(timeoutId);
                    phantom.exit();
                }
            });
            console.log('Found H1 tags, but not ' + titleMap[hash]);
        }
        console.log('No H1 tags found.');
    }, 300);
}

以上代码适用于我。但如果您需要搜索搜索结果,它将无法工作 - 您需要找出一个可以查找的标识元素或文本,而无需提前知道标题。

The above code works for me. But it won't work if you need to scrape search results - you'll need to figure out an identifying element or bit of text that you can look for without having to know the title ahead of time.

修改:此外,它看起来像最新版本的PhantomJS 现在会在获取新数据时触发 onResourceReceived 事件。我没有研究过这个,但你可能能够将一个监听器绑定到这个事件来达到同样的效果。

Edit: Also, it looks like the newest version of PhantomJS now triggers an onResourceReceived event when it gets new data. I haven't looked into this, but you might be able to bind a listener to this event to achieve the same effect.

这篇关于使用javascript(phantomjs)导航/抓取hashbang链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆