使用C ++解析HTML(最好使用Qt) [英] Parsing HTML with C++ (using Qt preferably)

查看:1913
本文介绍了使用C ++解析HTML(最好使用Qt)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用C ++解析一些HTML以从HTML中提取所有网址(网址可以在href和src属性中)。

I'm trying to parse some HTML with C++ to extract all urls from the HTML (the urls can be inside the href and src attributes).

使用Webkit为我做繁重的工作,但由于某些原因,当我加载一个框架的HTML生成的文档都是错误的(如果我让Webkit从Web获取页面生成的文档是很好,但Webkit也下载所有图像,样式和脚本,我不想那样)

I tried to use Webkit to do the heavy work for me but for some reason when I load a frame with HTML the generated document is all wrong (if I make Webkit get the page from the web the generated document is just fine but Webkit also downloads all images, styles, and scripts and I don't want that)

这是我试图做的:

frame->setHtml(HTML);
QWebElement document = frame->documentElement();
QList<QWebElement> imgs = document.findAll("a"); // Doesn't find all links
QList<QWebElement> imgs = document.findAll("img"); // Doesn't find all images
QList<QWebElement> imgs = document.findAll("script");// Doesn't find all scripts
qDebug() << document.toInnerXml(); // Print a completely messed-up document with several missing elements

我做错了什么?有一个简单的方法来解析HTML与Qt吗? (或其他一些轻量级的库)

What am I doing wrong? Is there an easy way to parse HTML with Qt? (Or some other lightweight library)

推荐答案

你可以使用XPath表达式让你的解析生活更容易, a href =http://doc.trolltech.com/4.5/qxmlquery.html#running-xpath-expressions =nofollow> this 。

You can always use XPath expressions to make your parsing life easier, take a look at this for instance.

或者你可以这样做

QWebView* view = new QWebView(parent);
view.load(QUrl("http://www.your_site.com"));
QWebElementCollection elements = view.page().mainFrame().findAllElements("a");

这篇关于使用C ++解析HTML(最好使用Qt)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆