是否有可能在JavaScript中编写Web爬虫? [英] is it possible to write web crawler in javascript?

查看:113
本文介绍了是否有可能在JavaScript中编写Web爬虫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取页面并检查相应页面中的超链接,并按照这些超链接并从页面捕获数据

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

推荐答案

通常,浏览器JavaScript只能在其来源的域内进行爬网,因为抓取页面将通过 Ajax ,受同源政策的限制。

Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.

如果运行抓取工具脚本的网页位于 www.example.com ,则该脚本可以抓取www.example.com上的所有网页,但不是任何其他来源的页面(除非某些边缘情况适用,例如, Access-Control-Allow-Origin 为另一台服务器上的页面设置标题。)

If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).

如果真的想写一个完整的功能d浏览器JS中的爬虫,您可以编写浏览器扩展名:例如, Chrome扩展程序是具有特殊权限的打包Web应用程序,包括跨源Ajax 。这种方法的难点在于,如果要支持多个浏览器,则必须编写多个版本的爬网程序。 (如果爬虫只是供个人使用,那可能不是问题。)

If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)

这篇关于是否有可能在JavaScript中编写Web爬虫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆