如何使用 phantomjs 抓取链接 [英] how to scrape links with phantomjs
问题描述
我正在尝试在 Etsy 上搜索并访问术语中的所有链接.在 Python 中,我知道如何做到这一点(使用 BeautifulSoup),但今天我想看看我是否可以使用 PhantomJS 做到这一点.我不会走得很远.
此脚本应在 Etsy 上搜索hello kitty"并返回所有产品<a class="listing-thumb" href=...></a>
并在控制台中打印它们.理想情况下,我稍后会访问他们并获取我需要的信息.现在它只是冻结.有什么想法吗?
var page = require('webpage').create();var url = 'http://www.etsy.com/search?q=hello%20kitty';page.open(网址,功能(状态){//列出 hello kitty etsy 页面中的所有 a.href 链接var 链接 = page.evaluate(function() {return document.querySelectorAll('a.listing-thumb');});for(var i = 0; i
我曾尝试使用 CasperJS,这可能是为此设计的更好.
PhantomJS evaluate()
不能序列化并返回像 HTMLElements 或 NodeLists 这样的复杂对象,所以你必须在之前将它们映射到可序列化的东西:
var page = require('webpage').create();var url = 'http://www.etsy.com/search?q=hello%20kitty';page.open(网址,功能(状态){//列出 hello kitty etsy 页面中的所有 a.href 链接var links = page.evaluate(function() {返回 [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {return link.getAttribute('href');});});console.log(links.join('
'));幻影.退出();});
注意:这里我们使用 [].map.call()
来将 NodeList
视为标准的 Array
.>
Can PhantomJS be used an an alternative to BeautifulSoup?
I am trying to search on Etsy and visit all the links in term. In Python, I know how to do this (with BeautifulSoup) but today I want to see if I can do the same with PhantomJS. I'm not getting very far.
This script should search "hello kitty" on Etsy and return all the of products
<a class="listing-thumb" href=...></a>
and print them in the console. Ideally I'd visit them later on and get the information I need. Right now it just freezes. Any ideas?
var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';
page.open(url, function(status){
// list all the a.href links in the hello kitty etsy page
var link = page.evaluate(function() {
return document.querySelectorAll('a.listing-thumb');
});
for(var i = 0; i < link.length; i++){ console.log(link[i].href); }
phantom.exit();
});
I have toyed with using CasperJS, which may be better designed for this.
PhantomJS evaluate()
cannot serialize and return complex objects like HTMLElements or NodeLists, so you have to map them to serializable things before:
var page = require('webpage').create();
var url = 'http://www.etsy.com/search?q=hello%20kitty';
page.open(url, function(status) {
// list all the a.href links in the hello kitty etsy page
var links = page.evaluate(function() {
return [].map.call(document.querySelectorAll('a.listing-thumb'), function(link) {
return link.getAttribute('href');
});
});
console.log(links.join('
'));
phantom.exit();
});
Note: here we use [].map.call()
in order to treat a NodeList
as a standard Array
.
这篇关于如何使用 phantomjs 抓取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!