寻找一个简单的Java蜘蛛 [英] Looking for a simple Java spider
问题描述
我需要提供一个基本 URL(例如 http://www.wired.com
),并且需要在整个站点中搜索并输出一组页面(脱离基本 URL).有没有可以做到这一点的图书馆?
I need to supply a base URL (such as http://www.wired.com
) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?
谢谢.
推荐答案
我用过 Web Harvest
几次,它非常适合网页抓取.
I have used Web Harvest
a couple of times, and it is quite good for web scraping.
Web-Harvest 是开源 Web 数据用Java编写的提取工具.它提供了一种收集所需 Web 的方法页面并从中提取有用的数据他们.为了做到这一点,它利用成熟的技术和文本/xml 技术操作,例如 XSLT、XQuery 和常用表达.网络收获主要关注基于 HTML/XML 的 web仍然占绝大多数的网站的 Web 内容.另一方面,它可以很容易地补充自定义 Java 库,以便增强其提取能力.
Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities.
或者,您可以使用诸如 JTidy
之类的工具滚动自己的网页抓取工具,首先将 HTML 文档转换为 XHTML,然后使用 XPath
处理您需要的信息.例如,从 http://www.wired.com
中提取所有超链接的 非常 naïve XPath 表达式将类似于 //a[contains(@href,'wired')]/@href
.您可以在此答案 到一个类似的问题.
Alternatively, you can roll your own web scraper using tools such as JTidy
to first convert an HTML document to XHTML, and then processing the information you need with XPath
. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com
, would be something like //a[contains(@href,'wired')]/@href
. You can find some sample code for this approach in this answer to a similar question.
这篇关于寻找一个简单的Java蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!