如何使用Java库“HTML Parser”删除所有< style>标签? [英] How do I use the java library "HTML Parser" to remove all <style> tags?

查看:133
本文介绍了如何使用Java库“HTML Parser”删除所有< style>标签?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要对html文件执行多个操作,例如删除特定标记或删除属性。我决定使用HTML Parser,一个Java库:
http://htmlparser.sourceforge.net/

I need to perform several action on a html file such as removing a specific tag or delete attributes. I decided to use HTML Parser, a java library: http://htmlparser.sourceforge.net/

首先,我想删除所有的样式标签。我设法通过这样做得到一个包含所有样式标记的NodeList:

First of all, I want to remove all the style tags. I managed to get a NodeList containing all the styles tag by doing this:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);            
NodeList styles = list.extractAllNodesThatMatch (new TagNameFilter ("STYLE"), true);

现在我不知道如何从整个节点列表中删除这个样式属性。我需要获取整个列表吗?

Now I don't know how to delete this style attributes from the whole list of nodes. Do I have to fetch the whole list?

之后,我希望能够删除标记中的所有属性或仅删除alt属性。有没有一种方法可以自动执行?

After that, I want to be able to delete all the attributes inside the tags or delete only the alt attributes for example. Is there a method which does that automatically?

推荐答案

从文档中, Parser 返回包含所有html节点的树列表(将解析器看作 Node 的大树的根节点),并且每个level tree是一个 NodeList )。

From the documentation, the Parser returns a list of trees that contains all of your html's nodes (think of the parser as the root node of a big tree of Node and each "level" of that tree is a NodeList).

您可以递归地循环遍历树,测试每个节点的类型为 StyleTag ,并在适用时从相应的 NodeList 中删除​​它。继续递归地下降到树中,直到您访问它的所有节点为止。

You can iterate through the tree recursively, test each node's type against StyleTag and delete it from the appropriate NodeList when applicable. Keep descending into the tree recursively until you visit all its nodes.

NodeTreeWalker 是您的朋友,可以帮助您递归树遍历。

NodeTreeWalker is your friend and can help you with the recursive tree traversal.

jsoup 是另一个不错的选择它有一个更简单的界面(参见这个问题 a>)。

jsoup is another nice alternative that has a simpler interface (see this other question).

这篇关于如何使用Java库“HTML Parser”删除所有< style>标签?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆