哪个网络爬虫用于从大约一千个网站中提取和解析数据 [英] Which web crawler for extracting and parsing data from about a thousand of web sites

查看:27
本文介绍了哪个网络爬虫用于从大约一千个网站中提取和解析数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取大约一千个网站,我只对其中的 html 内容感兴趣.

I'm trying to crawl about a thousand of web sites, from which I'm interested in the html content only.

然后我将 HTML 转换为 XML 以使用 Xpath 进行解析以提取我感兴趣的特定内容.

Then I transform the HTML into XML to be parsed with Xpath to extract the specific content I'm interested in.

我已经使用 Heritrix 2.0 爬虫几个月了,但我遇到了巨大的性能、内存和稳定性问题(Heritrix 几乎每天都崩溃,并且尝试使用 JVM 参数限制内存使用的尝试都没有成功).

I've been using Heritrix 2.0 crawler for a few months, but I ran into huge performance, memory and stability problems (Heritrix crashes about every day, and no attemps with JVM parameters to limit memory usage were successful).

根据您在该领域的经验,您会使用哪种爬虫从一千个来源中提取和解析内容?

From your experiences in the field, which crawler would you use for extracting and parsing content from a thousand of sources?

推荐答案

不会将 2.x 分支(已停产)或 3.x(当前开发)用于任何严重"爬行,除非您想帮助改进 Heritrix 或只是想处于最前沿.

I would not use the 2.x branch (which has been discontinued) or the 3.x (current development) for any 'serious' crawling unless you want to help improve Heritrix or just like being on the bleeding edge.

Heritrix 1.14.3 是最新的稳定版本,它确实稳定,被许多机构用于小型和大型爬网.我正在对数以万计的域进行爬网,在一周内收集了数千万个 URL.

Heritrix 1.14.3 is the most recent stable release and it really is stable, used by many institutions for both small and large scale crawling. I'm using to run crawls against tens of thousands of domains, collecting tens of millions of URLs in under a week.

3.x 分支越来越接近稳定版本,但即便如此,我还是会等待一段时间以供 The Internet Archive 和其他人使用以提高其性能和稳定性.

The 3.x branch is getting closer to a stable release, but even then I'd wait a bit for general use at The Internet Archive and others to improve its performance and stability.

更新:由于最近有人对此进行了投票,我觉得值得注意的是,Heritrix 3.x 现在很稳定,并且是那些刚开始使用 Heritrix 的人的推荐版本.

Update: Since someone up-voted this recently I feel it is worth noting that Heritrix 3.x is now stable and is the recommended version for those starting out with Heritrix.

这篇关于哪个网络爬虫用于从大约一千个网站中提取和解析数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆