Java HTML解析 [英] Java HTML Parsing

查看:83
本文介绍了Java HTML解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一个从网站上抓取数据的应用程序,我想知道如何获取数据。具体而言,我需要包含在使用特定CSS类的多个div标签中的数据 - 目前(用于测试目的)我只是检查

 <$在每一行HTML中使用c $ c> div class =classname

不禁感到有更好的解决方案。



有没有什么好的方法可以让一个类有一行HTML,并且有一些漂亮的方法:

  boolean usesClass(String CSSClassname); 
String getText();
String getLink();


解决方案

几年前,我用JTidy来达到同样的目的:



http://jtidy.sourceforge.net/ b
$ bJTidy是HTML Tidy的一个Java端口,它是一种HTML语法检查器和漂亮的打印机。与其非Java表兄弟类似,JTidy可以用作清理工具修复格式不正确和错误的HTML,另外,JTidy为正在处理的文档提供了一个DOM接口,这使得您可以使用JTidy作为真实世界HTML的DOM解析器。


JTidy是由Andy Quick撰写的,后者从维护者的职位上退出。现在JTidy由一群志愿者维护。

更多关于JTidy的信息可以在JTidy SourceForge项目页面找到。


I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

div class = "classname"

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

boolean usesClass(String CSSClassname);
String getText();
String getLink();

解决方案

Several years ago I used JTidy for the same purpose:

http://jtidy.sourceforge.net/

"JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.

More information on JTidy can be found on the JTidy SourceForge project page ."

这篇关于Java HTML解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆