寻找一个简单的Java蜘蛛 [英] Looking for a simple Java spider

查看:34
本文介绍了寻找一个简单的Java蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提供一个基本 URL(例如 http://www.wired.com),并且需要在整个站点中搜索并输出一组页面(脱离基本 URL).有没有可以做到这一点的图书馆?

I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?

谢谢.

推荐答案

我用过 Web Harvest 几次,它非常适合网页抓取.

I have used Web Harvest a couple of times, and it is quite good for web scraping.

Web-Harvest 是开源 Web 数据用Java编写的提取工具.它提供了一种收集所需 Web 的方法页面并从中提取有用的数据他们.为了做到这一点,它利用成熟的技术和文本/xml 技术操作,例如 XSLT、XQuery 和常用表达.网络收获主要关注基于 HTML/XML 的 web仍然占绝大多数的网站的 Web 内容.另一方面,它可以很容易地补充自定义 Java 库,以便增强其提取能力.

Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.

或者,您可以使用诸如 JTidy 之类的工具滚动自己的网页抓取工具,首先将 HTML 文档转换为 XHTML,然后使用 XPath 处理您需要的信息.例如,从 http://www.wired.com 中提取所有超链接的 非常 naïve XPath 表达式将类似于 //a[contains(@href,'wired')]/@href.您可以在此答案 到一个类似的问题.

Alternatively, you can roll your own web scraper using tools such as JTidy to first convert an HTML document to XHTML, and then processing the information you need with XPath. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com, would be something like //a[contains(@href,'wired')]/@href. You can find some sample code for this approach in this answer to a similar question.

这篇关于寻找一个简单的Java蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆