有关使用Regex和Java解析HTML的问题 [英] Question about parsing HTML using Regex and Java
问题描述
我有一个关于使用Java和Regex查找html标签的问题.
I Have a question about finding html tags using Java and Regex.
我正在使用下面的代码查找HTML中的所有标签,documentURL显然是HTML内容.
I am using the code below to find all the tags in HTML, documentURL is obviously the HTML content.
find方法返回true,这意味着它可以在HTML中找到某些内容,但是matchs()方法始终返回false,对此我感到非常困惑.
The find method return true, meaning that it can find something in the HTML but the matches() method always return false and I am completly and utterly puzzled about this.
我也参考了Java文档,但是找不到答案.
I refered to Java documentations too but could not find my answer.
使用Matcher的正确方法是什么?
What is the correct way of using Matcher ?
Pattern keyLineContents = Pattern.compile("(<.*?>)");
Matcher keyLineMatcher = keyLineContents.matcher(documentURL);
boolean result = keyLineMatcher.find();
boolean matchFound = keyLineMatcher.matches();
做这样的事情会抛出异常:
Doing something like this throws an exeption:
String abc = keyLineMatcher.group(0);
谢谢.
推荐答案
循环匹配的正确方法是:
The correct way to loop through matches is:
Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
System.out.println(m.group());
}
话虽这么说,正则表达式是一种解析HTML的极其差劲的方法.原因可以归结为:正则表达式可以很好地解析常规语言. HTML是一种无上下文语言.正则表达式掉落的地方是嵌套标记之类的东西,它们是在属性值内部使用>
等等.
That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using >
inside attribute values and so on.
使用专用的HTML解析器,例如 HTML解析器.
Use a dedicated HTML parser instead such as HTML Parser.
这篇关于有关使用Regex和Java解析HTML的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!