如何使用QT解析HTML文件? [英] How to parse an HTML file with QT?

查看:351
本文介绍了如何使用QT解析HTML文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标是实现QDomDocument或与HTML(而非XML)文档内容相似的内容.

The goal is to achieve a QDomDocument or something similar with the content of an HTML (not XML) document.

问题在于某些标签,尤其是script会触发错误:

The problem is that some tags, especially script trigger errors:

<!DOCTYPE html>
<html>
<head>
    <script type="text/javascript">
        var a = [1,2,3];
        var b = (2<a.length);
    </script>
</head>
<body/>
</html>

格式不正确:元素类型"a.length"必须在后面加上任一属性规范".或"/"".

Not well formed: Element type "a.length" must be followed by either attribute specifications, ">" or "/>".

我了解HTML与XML不同,但是Qt为此提供了解决方案似乎很合理:

I understand that HTML is not the same as XML, but it seems reasonable that Qt has a solution for this:

  • 将解析器设置为接受HTML
  • HTML的另一个类
  • 一种将某些标签名称设置为CDATA的方法.

我目前的尝试只能实现常规的XML解析:

My current try only achieves normal XML parsing:

QString mainHtml;

{
    QFile file("main.html");
    if (!file.open(QIODevice::ReadOnly)) qDebug() << "Error reading file main.html";
    QTextStream stream(&file);
    mainHtml = stream.readAll();
    file.close();
}

QQDomDocument doc;
QString errStr;
int errLine=0, errCol=0;
doc.setContent( mainHtml, false, &errStr, &errLine, &errCol);
if (!errStr.isEmpty())
{
    qDebug() << errStr << "L:" << errLine << ":" << errCol;
}

std::function<void(const QDomElement&, int)> printTags=
[&printTags](const QDomElement& elem, int tab)
{
    QString space(3*tab, ' ');
    QDomNode n = elem.firstChild();
    for( ;!n.isNull(); n=n.nextSibling()) 
    {
        QDomElement e = n.toElement();
        if(e.isNull()) continue;
        
        qDebug() << space + e.tagName(); 
        printTags( e, tab+1);
    }
};
printTags(doc.documentElement(), 0);

注意:我想避免为此包含完整的Webkit.

Note: I would like to avoid including the full webkit for this.

推荐答案

我建议使用 htmlcxx .它是根据LPGL授权的.它适用于Linux和Windows.如果您使用Windows编译 msys.

I recommend to use htmlcxx. It is licensed under LPGL. It works on Linux and Windows. If you use windows compile with msys.

要进行编译,只需提取文件并运行

To compile it just extract the files and run

./configure --prefix=/usr/local/htmlcxx
make
make install

在.pro文件中,添加包含和库目录.

In your .pro file add the include and library directory.

INCLUDEPATH += /usr/local/htmlcxx/include
LIBS += -L/usr/local/htmlcxx/lib -lhtmlcxx

用法示例

#include <iostream>
#include "htmlcxx/html/ParserDom.h"
#include <stdlib.h>

int main (int argc, char *argv[])
{
  using namespace std;
  using namespace htmlcxx;

  //Parse some html code
  string html = "<html><body>hey<A href=\"www.bbxyard.com\">myhome</A></body></html>";
  HTML::ParserDom parser;
  tree<HTML::Node> dom = parser.parseTree(html);
  //Print whole DOM tree
  cout << dom << endl;

  //Dump all links in the tree
  tree<HTML::Node>::iterator it = dom.begin();
  tree<HTML::Node>::iterator end = dom.end();
  for (; it != end; ++it)
  {
     if (strcasecmp(it->tagName().c_str(), "A") == 0)
     {
       it->parseAttributes();
       cout << it->attribute("href").second << endl;
     }
  }

  //Dump all text of the document
  it = dom.begin();
  end = dom.end();
  for (; it != end; ++it)
  {
    if ((!it->isTag()) && (!it->isComment()))
    {
      cout << it->text() << " ";
    }
  }
  cout << endl;
  return 0;
}

此示例的信用: https://github.com/bbxyard/sdk/blob/master/examples/htmlcxx/htmlcxx-demo.cpp

您不能将XML解析器用于HTML.您可以使用htmlcxx或将HTML转换为有效的XML.然后,您可以自由使用QDomDocument,Qt XML解析器等.

You can't use an XML parser for HTML. You either use htmlcxx or convert the HTML to valid XML. Then you are free to use QDomDocument, Qt XML parsers, etc.

QWebEngine也具有解析功能,但是会给应用程序带来很大的开销.

QWebEngine has also parsing functionality, but brings a large overhead with the application.

这篇关于如何使用QT解析HTML文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆