如何使用HTML解析器获取网页标题 [英] How to get web page title using html parser

查看:134
本文介绍了如何使用HTML解析器获取网页标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用HTML解析器获取给定URL的网页标题?是否可以使用正则表达式获取标题?我更喜欢使用HTML解析器.

How can I get the title of a web page for a given URL using an HTML parser? Is it possible to get the title using regular expressions? I would prefer to use an HTML parser.

我正在Java Eclipse IDE中工作.

I am working in the Java Eclipse IDE.

我尝试使用以下代码,但未成功.

I have tried using the following code, but was unsuccessful.

有什么想法吗?

提前谢谢!

import org.htmlparser.Node;

import org.htmlparser.Parser;

import org.htmlparser.util.NodeList;

import org.htmlparser.util.ParserException;

import org.htmlparser.tags.TitleTag;    

public class TestHtml {

public static void main(String... args) {
    Parser parser = new Parser();     
    try {
        parser.setResource("http://www.yahoo.com/");
        NodeList list = parser.parse(null);
        Node node = list.elementAt(0);

        if (node instanceof TitleTag) {
           TitleTag title = (TitleTag) node;


            System.out.println(title.getText());

        }

    } catch (ParserException e) {
        e.printStackTrace();
    }
}

}

推荐答案

根据您的(重新定义的)问题,问题在于,您仅检查第一个节点Node node = list.elementAt(0);,而应遍历列表以查找标题(这不是第一个).您还可以为parse()使用NodeFilter来仅返回TitleTag,然后标题将位于第一个,并且您不必进行迭代.

According to your (redefined) question, the problem is that you only check the first node Node node = list.elementAt(0); while you should iterate over the list to find the title (which is not the first). You could also use a NodeFilter for your parse() to only return the TitleTag and then the title would be in the first and you wouldn't have to iterate.

这篇关于如何使用HTML解析器获取网页标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆