在 node.js 中进行屏幕抓取最优雅的方法是什么? [英] What is the most elegant way to do screen scraping in node.js?
问题描述
我正在编写一个在 node.js 中使用大量屏幕抓取的网络应用程序.我觉得我在每个角落都在与潮流作斗争.必须有一种更简单的方法来做到这一点.最值得注意的是,有两件事令人恼火:
I'm in the process of hacking together a web app which uses extensive screen scraping in node.js. I feel like I'm fighting against the current at every corner. There must be an easier way to do this. Most notably, two things are irritating:
Cookie 传播.我可以从响应头中提取 'set-cookie' 数组,但是执行字符串操作来解析数组中的 cookie 感觉非常hackish.
Cookie propagation. I can pull the 'set-cookie' array out of the response headers, but performing string operations to parse the cookies out of the array feels extremely hackish.
重定向关注.我希望每个请求在返回 302 状态代码时都进行重定向.
Redirect following. I want each request to follow through redirects when a 302 status code is returned.
我遇到了两件看起来很有用的东西,但最终我无法使用:
I came across two things which looked useful, but I couldn't use in the end:
http://zombie.labnotes.org/,但它没有 HTTPS支持,所以我不能使用它.
http://zombie.labnotes.org/, but it doesn't have HTTPS support, so I can't use it.
http://www.phantomjs.org/,但无法使用,因为它没有(似乎)与 node.js 集成.对于我正在做的事情来说,它也是相当重量级的.
http://www.phantomjs.org/, but couldn't use it because it doesn't (appear to) integrate with node.js. It's also pretty heavyweight for what I'm doing.
是否有任何 JavaScript 屏幕抓取式库可以传播 cookie、跟踪重定向并支持 HTTPS?关于如何使这更容易的任何指示?
Are there any JavaScript screenscraping-esque libraries which propagate cookies, follow redirects, and support HTTPS? Any pointers on how to make this easier?
推荐答案
我现在实际上有一个爬虫库 https://github.com/mikeal/spider 挺不错的,可以用jquery和routes.
i actually have a scraper library now https://github.com/mikeal/spider it's quite nice, you can use jquery and routes.
欢迎反馈:)
这篇关于在 node.js 中进行屏幕抓取最优雅的方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!