如何使用 phantomjs 抓取链接 [英] how to scrape links with phantomjs

查看:24
本文介绍了如何使用 phantomjs 抓取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以使用 PhantomJS 替代 美汤?

我正在尝试在 Etsy 上搜索并访问术语中的所有链接.在 Python 中,我知道如何做到这一点(使用 BeautifulSoup),但今天我想看看我是否可以使用 PhantomJS 做到这一点.我不会走得很远.

此脚本应在 Etsy 上搜索hello kitty"并返回所有产品<a class="listing-thumb" href=...></a> 并在控制台中打印它们.理想情况下,我稍后会访问他们并获取我需要的信息.现在它只是冻结.有什么想法吗?

var page = require('webpage').create();var url = 'http://www.etsy.com/search?q=hello%20kitty';page.open(网址,功能(状态){//列出 hello kitty etsy 页面中的所有 a.href 链接var 链接 = page.evaluate(function() {return document.querySelectorAll('a.listing-thumb');});for(var i = 0; i 

我曾尝试使用 CasperJS,这可能是为此设计的更好.

解决方案

PhantomJS evaluate() 不能序列化并返回像 HTMLElements 或 NodeLists 这样的复杂对象,所以你必须在之前将它们映射到可序列化的东西:

var page = require('webpage').create();var url = 'http://www.etsy.com/search?q=hello%20kitty';page.open(网址,功能(状态){//列出 hello kitty etsy 页面中的所有 a.href 链接var links = page.evaluate(function() {返回 [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {return link.getAttribute('href');});});console.log(links.join('
'));幻影.退出();});

注意:这里我们使用 [].map.call() 来将 NodeList 视为标准的 Array.>

Can PhantomJS be used an an alternative to BeautifulSoup?

I am trying to search on Etsy and visit all the links in term. In Python, I know how to do this (with BeautifulSoup) but today I want to see if I can do the same with PhantomJS. I'm not getting very far.

This script should search "hello kitty" on Etsy and return all the of products <a class="listing-thumb" href=...></a> and print them in the console. Ideally I'd visit them later on and get the information I need. Right now it just freezes. Any ideas?

var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';

page.open(url, function(status){
    // list all the a.href links in the hello kitty etsy page
    var link = page.evaluate(function() {
        return document.querySelectorAll('a.listing-thumb');
    });
    for(var i = 0; i < link.length; i++){ console.log(link[i].href); }
    phantom.exit();
});

I have toyed with using CasperJS, which may be better designed for this.

解决方案

PhantomJS evaluate() cannot serialize and return complex objects like HTMLElements or NodeLists, so you have to map them to serializable things before:

var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';

page.open(url, function(status) {
    // list all the a.href links in the hello kitty etsy page
    var links = page.evaluate(function() {
        return [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {
            return link.getAttribute('href');
        });
    });
    console.log(links.join('
'));
    phantom.exit();
});

Note: here we use [].map.call() in order to treat a NodeList as a standard Array.

这篇关于如何使用 phantomjs 抓取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆