有关使用Regex和Java解析HTML的问题 [英] Question about parsing HTML using Regex and Java

查看:65
本文介绍了有关使用Regex和Java解析HTML的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于使用Java和Regex查找html标签的问题.

I Have a question about finding html tags using Java and Regex.

我正在使用下面的代码查找HTML中的所有标签,documentURL显然是HTML内容.

I am using the code below to find all the tags in HTML, documentURL is obviously the HTML content.

find方法返回true,这意味着它可以在HTML中找到某些内容,但是matchs()方法始终返回false,对此我感到非常困惑.

The find method return true, meaning that it can find something in the HTML but the matches() method always return false and I am completly and utterly puzzled about this.

我也参考了Java文档,但是找不到答案.

I refered to Java documentations too but could not find my answer.

使用Matcher的正确方法是什么?

What is the correct way of using Matcher ?

    Pattern keyLineContents = Pattern.compile("(<.*?>)");

    Matcher keyLineMatcher = keyLineContents.matcher(documentURL);

    boolean result = keyLineMatcher.find();

    boolean matchFound = keyLineMatcher.matches();

做这样的事情会抛出异常:

Doing something like this throws an exeption:

     String abc = keyLineMatcher.group(0);

谢谢.

推荐答案

循环匹配的正确方法是:

The correct way to loop through matches is:

Pattern p = Pattern.compile("<.*?>");
Matcher m = p.matcher(htmlString);
while (m.find()) {
  System.out.println(m.group());
}

话虽这么说,正则表达式是一种解析HTML的极其差劲的方法.原因可以归结为:正则表达式可以很好地解析常规语言. HTML是一种无上下文语言.正则表达式掉落的地方是嵌套标记之类的东西,它们是在属性值内部使用>等等.

That being said, regular expressions are an extremely poor method of parsing HTML. The reason comes down to this: regular expressions work well for parsing regular languages. HTML is a context free language. Where regular expressions fall down is for things like nested tags, using > inside attribute values and so on.

使用专用的HTML解析器,例如 HTML解析器.

Use a dedicated HTML parser instead such as HTML Parser.

这篇关于有关使用Regex和Java解析HTML的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆