我怎样才能保持“当我使用 QDomDocument 解析 html 数据时? [英] How can I keep “ when I use QDomDocument to parse html data?

查看:25
本文介绍了我怎样才能保持“当我使用 QDomDocument 解析 html 数据时?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

void test()
    {
        QDomDocument doc("doc");
        QByteArray data = "<div><p>Of course, &ldquo;Jason.&rdquo; My thoughts, exactly.</p></div>";

        QString sErrorMsg;
        int errLine, errCol;

        if (!doc.setContent(data, &sErrorMsg, &errLine, &errCol)) {
            qDebug() << sErrorMsg;
            qDebug() << errLine << ":" << errCol;
            return;
        }

        QDomNodeList pList = doc.elementsByTagName("p");
        for (int i = 0; i < pList.size(); i++)
        {
            QDomNode p = pList.at(i);
            while (!p.isNull()) {
                QDomElement e = p.toElement(); 
                if (!e.isNull()) {
                    QByteArray ba = e.text().toUtf8(); //Here, there is no left and right quota marks anymore.

                }
                p = p.nextSibling();
            }
        }

    }

我正在用 &ldquo;&rdquo; 解析一个 html 短语.代码运行到 QByteArray ba = e.text().toUtf8(); 没有配额标记.

I'm parsing a html phrase with &ldquo;and &rdquo;. The code runs to QByteArray ba = e.text().toUtf8(); without the quota marks.

我如何保留它们?

推荐答案

我必须承认这是我第一次使用 QDomDocument 虽然我已经对 XML 和 libXml2<有一些经验/a> 特别是.

I must admit that this is the first time that I used QDomDocument although I already have some experience with XML in general and libXml2 specifically.

首先,我可以确认 QDomElement::text() 返回没有实体编码的印刷引号的文本.

First, I can confirm that QDomElement::text() returns text without the typographical quotes encoded by entities.

我稍微修改了 OP 的 MCVE,现在应该很明显为什么会发生这种情况了.

I modified the MCVE of OP a bit and now, it should be obvious why this happens.

我的testQDomDocument.cc:

#include <QtXml>

static const char* toString(QDomNode::NodeType nodeType);

int main(int, char**)
{
  QByteArray text = "<div><p>Of course, &ldquo;Jason.&rdquo; My thoughts, exactly.</p></div>";
  // setup doc. DOM
  QDomDocument qDomDoc("doc");
  QString qErrorMsg; int errorLine = 0, errorCol = 0;
  if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
    qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
    return 1;
  }
  // inspect DOM
  QDomNodeList qListP = qDomDoc.elementsByTagName("p");
  const int nP = qListP.size();
  qDebug() << "Number of found <p> nodes:" << nP;
  for (int i = 0; i < nP; ++i) {
    const QDomNode qNodeP = qListP.at(i);
    qDebug() << "node <p> #" << i;
    qDebug() << "node.toElement().text(): " << qNodeP.toElement().text();
    for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
      qDebug() << toString(qNode.nodeType());
      switch (qNode.nodeType()) {
        case QDomNode::TextNode:
#if 1 // IMHO, the correct way:
          qDebug() << qNode.toText().data();
#else // works as well:
          qDebug() << qNode.nodeValue();
#endif // 1
          break;
        case QDomNode::EntityReferenceNode:
          qDebug() << qNode.nodeName();
          break;
        default:; // rest of types left out to keep sample short
      }
    }
  }
  // done
  return 0;
}

const char* toString(QDomNode::NodeType nodeType)
{
  static const std::map<QDomNode::NodeType, const char*> mapNodeTypes {
    { QDomNode::ElementNode, "QDomNode::ElementNode" },
    { QDomNode::AttributeNode, "QDomNode::AttributeNode" },
    { QDomNode::TextNode, "QDomNode::TextNode" },
    { QDomNode::CDATASectionNode, "QDomNode::CDATASectionNode" },
    { QDomNode::EntityReferenceNode, "QDomNode::EntityReferenceNode" },
    { QDomNode::EntityNode, "QDomNode::EntityNode" },
    { QDomNode::ProcessingInstructionNode, "QDomNode::ProcessingInstructionNode" },
    { QDomNode::CommentNode, "QDomNode::CommentNode" },
    { QDomNode::DocumentNode, "QDomNode::DocumentNode" },
    { QDomNode::DocumentTypeNode, "QDomNode::DocumentTypeNode" },
    { QDomNode::DocumentFragmentNode, "QDomNode::DocumentFragmentNode" },
    { QDomNode::NotationNode, "QDomNode::NotationNode" },
    { QDomNode::BaseNode, "QDomNode::BaseNode" },
    { QDomNode::CharacterDataNode, "QDomNode::CharacterDataNode" }
  };
  const std::map<QDomNode::NodeType, const char*>::const_iterator iter
    = mapNodeTypes.find(nodeType);
  return iter != mapNodeTypes.end() ? iter->second : "<ERROR>";
}

Qt 项目文件 –testQDomDocument.pro:

The Qt project file – testQDomDocument.pro:

SOURCES = testQDomDocument.cc

QT += xml

构建和测试:

$ qmake-qt5 testQDomDocument.pro

$ make && ./testQDomDocument 
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc
g++  -o testQDomDocument.exe testQDomDocument.o   -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread 
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text():  "Of course, Jason. My thoughts, exactly."
QDomNode::TextNode
"Of course, "
QDomNode::EntityReferenceNode
"ldquo"
QDomNode::TextNode
"Jason."
QDomNode::EntityReferenceNode
"rdquo"
QDomNode::TextNode
" My thoughts, exactly."

$

要了解发生了什么,了解

的内容没有存储在

<的 QDomNode 实例中会有所帮助/code> 直接.相反,<p>QDomNode 实例(以及任何其他元素)具有子节点来存储其内容,例如一个 QDomText 实例来存储一段文本.

To understand what happened it helps to know that the contents of <p> isn't stored in the QDomNode instance for <p> directly. Instead, the QDomNode instance for <p> (as well as any other element) has child nodes to store its contents, e.g. a QDomText instance to store a piece of text.

因此,QDomElement::text() 是一个方便的函数,它返回(收集的)文本,但似乎忽略了任何其他节点.在 OP 示例中,并非

QDomElement 的所有子节点都是文本节点.

So, the QDomElement::text() is a convenience function which returns only the (collected) text but seems to ignore any other nodes. In OPs sample, not all child nodes of the QDomElement for <p> are text nodes.

实体(&ldquo;&rdquo;)存储为<​​a href="https://doc.qt.io/qt-5/qdomentityreference.html" rel="nofollow noreferrer">QDomEntityReference 实例,显然在 QDomElement::text() 中跳过了.

The entities (&ldquo;, &rdquo;) are stored as QDomEntityReference instances and obviously skipped in QDomElement::text().

我必须承认我有点惊讶,因为(根据我在 libXml2 中的经验)我已经习惯了实体也被解析为文本的事实.

I must admit I was a bit surprised because (according to my experience in libXml2) I'm used to the fact that entities are resolved into text as well.

QDomEntityReference中的段落:

此外,XML 处理器可以在构建 DOM 树时完全扩展对实体的引用,而不是提供 QDomEntityReference 对象.

Moreover, the XML processor may completely expand references to entities while building the DOM tree, instead of providing QDomEntityReference objects.

支持我对 QDomDocument 的相同期望.

supported my same expectation for QDomDocument.

然而,样本表明在这种情况下情况并非如此.

However, the sample shows that this isn't true in this case.

三思而后行,我意识到&ldquo;&rdquo;不是 XML 中的预定义实体.

Thinking twice, I realized that &ldquo; and &rdquo; are not predefined entities in XML.

HTML5(及之前)中是这种情况,但在一般的 XML 中则不然.

This is the case in HTML5 (and before) but not in general XML.

XML 中唯一的预定义实体是:

The only predefined entities in XML are:

Name | Chr. | Codepoint   | Meaning
-----+------+-------------+-----------------
quot |  "   | U+0022 (34) | quotation mark
amp  |  &   | U+0026 (38) | ampersand
apos |  '   | U+0027 (39) | apostrophe
lt   |  <   | U+003C (60) | less-than sign
gt   |  >   | U+003E (62) | greater-than sign

所以,为了替换 HTML 实体,QDomDocument 中还需要一些其他的东西.

So, for the replacement of HTML entities, something else is needed in QDomDocument.

顺便说一句.在寻找这个方向的提示时,我偶然发现:

Btw. while looking for a hint into this direction, I stumbled into:

SO:QDomDocument 无法设置带有标签的 HTML 文档的内容

我想了一会儿如何解决这个问题.

I thought a while about how this can be fixed.

我想知道我没有立即想到一个非常简单的修复:用数字字符引用替换实体.

I wonder that I didn't think immediately on a very simple fix: replacing the entities by numeric character references.

HTML Entity | NCR
------------+----------
&ldquo;     | &#x201c;
&rdquo;     | &#x201d;

对上述示例稍作修改:

int main(int, char**)
{
  QByteArray text =
    "<div><p>Of course, &#x201c;Jason.&#x201d; My thoughts, exactly.</p></div>";
  // setup doc. DOM
  QDomDocument qDomDoc("doc");
  QString qErrorMsg; int errorLine = 0, errorCol = 0;
  if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
    qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
    return 1;
  }
  // inspect DOM
  QDomNodeList qListP = qDomDoc.elementsByTagName("p");
  const int nP = qListP.size();
  qDebug() << "Number of found <p> nodes:" << nP;
  for (int i = 0; i < nP; ++i) {
    const QDomNode qNodeP = qListP.at(i);
    qDebug() << "node <p> #" << i;
    qDebug() << "node.toElement().text(): " << qNodeP.toElement().text().toUtf8();
    for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
      qDebug() << toString(qNode.nodeType());
      switch (qNode.nodeType()) {
        case QDomNode::TextNode:
          qDebug() << qNode.toText().data().toUtf8();
          break;
        case QDomNode::EntityReferenceNode:
          qDebug() << qNode.nodeName();
          break;
        default:; // rest of types left out to keep sample short
      }
    }
  }
  // done
  return 0;
}

我得到以下输出:

$ make && ./testQDomDocument
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc
g++  -o testQDomDocument.exe testQDomDocument.o   -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread 
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text():  "Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."
QDomNode::TextNode
"Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."

$

等等!现在,

中只有一个子节点,包含编码为 NCR 的引号的完整文本.

Et voilà! Now, there is only one child node in <p> with the complete text including the quotes which are encoded as NCRs.

虽然,引号的输出为 \xE2\x80\x9C\xE2\x80\x9D 让我有点不确定.(请注意,我添加了 .toUtf8() 来调试输出,因为我之前得到了 ??.)

Though, the output of the quotes as \xE2\x80\x9C and \xE2\x80\x9D made me a bit uncertain. (Please, note that I added .toUtf8() to debug output because I got ? and ? before.)

简短检查 UTF-8 编码表和 Unicode字符让我相信这些 UTF-8 字节序列是正确的.
但为什么要逃跑?
我的 bashLANG 设置错误?

A short check in UTF-8 encoding table and Unicode characters convinced me that these UTF-8 byte sequences are correct.
But why the escaping?
Wrong LANG setting of my bash?

$ ./testQDomDocument 2>&1 | hexdump -C
00000000  4e 75 6d 62 65 72 20 6f  66 20 66 6f 75 6e 64 20  |Number of found |
00000010  3c 70 3e 20 6e 6f 64 65  73 3a 20 31 0a 6e 6f 64  |<p> nodes: 1.nod|
00000020  65 20 3c 70 3e 20 23 20  30 0a 6e 6f 64 65 2e 74  |e <p> # 0.node.t|
00000030  6f 45 6c 65 6d 65 6e 74  28 29 2e 74 65 78 74 28  |oElement().text(|
00000040  29 3a 20 20 22 4f 66 20  63 6f 75 72 73 65 2c 20  |):  "Of course, |
00000050  5c 78 45 32 5c 78 38 30  5c 78 39 43 4a 61 73 6f  |\xE2\x80\x9CJaso|
00000060  6e 2e 5c 78 45 32 5c 78  38 30 5c 78 39 44 20 4d  |n.\xE2\x80\x9D M|
00000070  79 20 74 68 6f 75 67 68  74 73 2c 20 65 78 61 63  |y thoughts, exac|
00000080  74 6c 79 2e 22 0a 51 44  6f 6d 4e 6f 64 65 3a 3a  |tly.".QDomNode::|
00000090  54 65 78 74 4e 6f 64 65  0a 22 4f 66 20 63 6f 75  |TextNode."Of cou|
000000a0  72 73 65 2c 20 5c 78 45  32 5c 78 38 30 5c 78 39  |rse, \xE2\x80\x9|
000000b0  43 4a 61 73 6f 6e 2e 5c  78 45 32 5c 78 38 30 5c  |CJason.\xE2\x80\|
000000c0  78 39 44 20 4d 79 20 74  68 6f 75 67 68 74 73 2c  |x9D My thoughts,|
000000d0  20 65 78 61 63 74 6c 79  2e 22 0a                 | exactly.".|
000000db

$

啊哈.这似乎是由 qDebug() 引起的,它转义了所有值为 128 及以上的字节.

Aha. That rather seems to be caused by qDebug() which escapes all bytes with values of 128 and above.

这篇关于我怎样才能保持&amp;ldquo;当我使用 QDomDocument 解析 html 数据时?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆