如何用刮链接phantomjs [英] how to scrape links with phantomjs
问题描述
可以 PhantomJS 中使用的到的 BeautifulSoup ?
我想搜索的Etsy的参观足月的所有环节。在Python中,我知道如何做到这一点(与BeautifulSoup),但今天我想看看我能不能做同样的PhantomJS。我没有得到很远。
I am trying to search on Etsy and visit all the links in term. In Python, I know how to do this (with BeautifulSoup) but today I want to see if I can do the same with PhantomJS. I'm not getting very far.
该脚本应搜索的Etsy的凯蒂猫,并返回所有产品<一类=挂牌拇指的href = ...>< / A>
并在控制台打印出来。理想情况下我会去拜访他们以后得到我需要的信息。现在,它只是冻结。任何想法?
This script should search "hello kitty" on Etsy and return all the of products
<a class="listing-thumb" href=...></a>
and print them in the console. Ideally I'd visit them later on and get the information I need. Right now it just freezes. Any ideas?
var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';
page.open(url, function(status){
// list all the a.href links in the hello kitty etsy page
var link = page.evaluate(function() {
return document.querySelectorAll('a.listing-thumb');
});
for(var i = 0; i < link.length; i++){ console.log(link[i].href); }
phantom.exit();
});
我已经玩弄使用 CasperJS ,这可能为这个更好的设计。
I have toyed with using CasperJS, which may be better designed for this.
推荐答案
PhantomJS 评估()
不能序列化和返回象HTML元素或的NodeLists复杂的对象,所以你必须映射他们之前序列化的东西:
PhantomJS evaluate()
cannot serialize and return complex objects like HTMLElements or NodeLists, so you have to map them to serializable things before:
var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';
page.open(url, function(status) {
// list all the a.href links in the hello kitty etsy page
var links = page.evaluate(function() {
return [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {
return link.getAttribute('href');
});
});
console.log(links.join('\n'));
phantom.exit();
});
请注意:在这里我们使用 [] map.call()
为了治疗节点列表
作为。标准阵列
。
Note: here we use [].map.call()
in order to treat a NodeList
as a standard Array
.
这篇关于如何用刮链接phantomjs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!