如何使用HTML解析器获取网页标题 [英] How to get web page title using html parser
问题描述
如何使用HTML解析器获取给定URL的网页标题?是否可以使用正则表达式获取标题?我更喜欢使用HTML解析器.
How can I get the title of a web page for a given URL using an HTML parser? Is it possible to get the title using regular expressions? I would prefer to use an HTML parser.
我正在Java Eclipse IDE中工作.
I am working in the Java Eclipse IDE.
我尝试使用以下代码,但未成功.
I have tried using the following code, but was unsuccessful.
有什么想法吗?
提前谢谢!
import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.tags.TitleTag;
public class TestHtml {
public static void main(String... args) {
Parser parser = new Parser();
try {
parser.setResource("http://www.yahoo.com/");
NodeList list = parser.parse(null);
Node node = list.elementAt(0);
if (node instanceof TitleTag) {
TitleTag title = (TitleTag) node;
System.out.println(title.getText());
}
} catch (ParserException e) {
e.printStackTrace();
}
}
}
推荐答案
根据您的(重新定义的)问题,问题在于,您仅检查第一个节点Node node = list.elementAt(0);
,而应遍历列表以查找标题(这不是第一个).您还可以为parse()
使用NodeFilter
来仅返回TitleTag
,然后标题将位于第一个,并且您不必进行迭代.
According to your (redefined) question, the problem is that you only check the first node Node node = list.elementAt(0);
while you should iterate over the list to find the title (which is not the first). You could also use a NodeFilter
for your parse()
to only return the TitleTag
and then the title would be in the first and you wouldn't have to iterate.
这篇关于如何使用HTML解析器获取网页标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!