解析Java中的元标记 [英] parse meta tags in Java
问题描述
我有一组HTML文档,我需要解析< meta>的内容。 < head>中的标签部分。这些是我唯一感兴趣的值的HTML标签,即我不需要解析< body>中的任何内容。部分。
I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I'm interested in, i.e. I don't need to parse anything in the <body> section.
我试图使用JDom提供的XPath支持来解析这些值。但是,这并不是很好,因为< body>中的很多HTML都是如此。 section是无效的XML。
I've attempted to parse these values using the XPath support provided by JDom. However, this isn't working out too well because a lot of the HTML in the <body> section is not valid XML.
有没有人对我如何以可以处理格式错误的HTML的方式解析这些标记值有任何建议?
Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML?
干杯,
Don
Cheers, Don
推荐答案
你可以使用 Jericho HTML Parser 。特别要看看这个,看看你怎么走关于寻找特定标签。
You can likely use the Jericho HTML Parser. In particular, have a look at this to see how you can go about finding specific tags.
这篇关于解析Java中的元标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!