Google Chrome浏览器扩展程序中的网页抓取(JavaScript + Chrome API) [英] Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

查看:172
本文介绍了Google Chrome浏览器扩展程序中的网页抓取(JavaScript + Chrome API)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用JavaScript以及任何可用的更多技术执行对Google Chrome扩展程序中当前未打开的选项卡进行的最佳选择是其他JavaScript库也被接受。

重要的是掩盖刮取行为与正常的web请求即可。 AJAX或XMLHttpRequest的,没有迹象表明像 X-请求-随着:XMLHttpRequest的来源



被抓取的内容必须可以从JavaScript访问,以便在扩展中进行进一步的操作和呈现,最可能的是以字符串形式显示。



在任何WebKit / Chrome特定的API中都可以使用钩子:可以用来创建正常的Web请求并获得操作结果?

  var pageContent = getPageContent(url); // TODO:实现
var items = $(pageContent).find('。item');
//显示包含更多选项的项目

请从磁盘上的本地文件 进行此项工作以进行初始调试。但如果这是唯一的一点是停止解决方案,那么忽视奖励点。

解决方案

如果你很好看Google Chrome Plugin以外的内容,请查看 phantomjs ,它在后台使用Qt-Webkit,并且像浏览器一样运行, ajax请求。你可以称之为无头浏览器,因为它不会在屏幕上显示输出,并且在你做其他事情时可以在后台工作。如果你愿意,你可以将它们的图像,pdf从它提取的页面中导出。它提供了JS接口来加载页面,点击按钮等,就像你在浏览器中一样。您还可以在要扫描的任何页面上插入自定义JS(例如jQuery),并使用它访问dom并导出所需的数据。在使用 Webkit 时,其呈现行为与Google Chrome完全相同。



另一种选择是使用基于Mozilla Engine的Aptana Jaxer ,它本身就是一个非常好的概念。它也可以用作简单的抓取工具。


What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest or Origin.

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

解决方案

If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.

Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.

这篇关于Google Chrome浏览器扩展程序中的网页抓取(JavaScript + Chrome API)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆