在 node.js 中进行屏幕抓取最优雅的方法是什么? [英] What is the most elegant way to do screen scraping in node.js?

查看:75
本文介绍了在 node.js 中进行屏幕抓取最优雅的方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个在 node.js 中使用大量屏幕抓取的网络应用程序.我觉得我在每个角落都在与潮流作斗争.必须有一种更简单的方法来做到这一点.最值得注意的是,有两件事令人恼火:

I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:

  1. Cookie 传播.我可以从响应头中提取 'set-cookie' 数组,但是执行字符串操作来解析数组中的 cookie 感觉非常hackish.

  1. Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.

重定向关注.我希望每个请求在返回 302 状态代码时都进行重定向.

Redirect following. I want each request to follow through redirects when a 302 status code is returned.

我遇到了两件看起来很有用的东西,但最终我无法使用:

I came across two things which looked useful, but I couldn't use in the end:

  • http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.

http://www.phantomjs.org/,但无法使用,因为它没有(似乎)与 node.js 集成.对于我正在做的事情来说,它也是相当重量级的.

http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.

是否有任何 JavaScript 屏幕抓取式库可以传播 cookie、跟踪重定向并支持 HTTPS?关于如何使这更容易的任何指示?

Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?

推荐答案

我现在实际上有一个爬虫库 https://github.com/mikeal/spider 挺不错的,可以用jquery和routes.

i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.

欢迎反馈:)

这篇关于在 node.js 中进行屏幕抓取最优雅的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆