用纯 C++ 解析简单的 html [英] Parse simple html with pure C++

查看:41
本文介绍了用纯 C++ 解析简单的 html的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的应用程序中,我需要在不使用尽可能少的外部库的情况下解析简单的 HTML 代码.我的 HTML 看起来像

第一内容是P</p><h2>头</h2><p>标题下的文本</p><h2>标题2</h2><p>段落</p><h3>是的</h3><p>没有</p>

我的 html 只包含标签 p, h2, h3.我得到了以下结构:

struct 元素 {std::string 标签;std::string 内容;};std::vector<元素>元素;

所以我的目标是在解析向量中的每个元素之后应该包含这样的数据:

tag = "h2"内容=标题"

tag = "p"content = "第一个内容是 P"

PP:我需要按照它们在 HTML 中的显示顺序来获取元素.

我刚刚在 javascript 中做了这个,它工作得很好,但我基本上不知道如何用 c++ 写下来:

var a = "

第一个内容是 P

Header

标题下的文本

"+<h2>标题2</h2><p>段落</p>"+<h3>是</h3><p>否</p>";var 输出 = [];a.replace(/<\b[^>]*>(.*?)<\/(.*?)>/gmi, function(m, key, value) {输出.push({标签:价值,数据:关键});})/*输出:{ tag: "p", data: "第一个内容是 P"},{标签:h2",数据:标题"}.....*/

解决方案

只有这三个元素,并且没有缺少关闭标签.看起来好像标签上没有属性,甚至元素内也没有任何元素.标签内也没有空格.

那么您没有解析 HTML.您正在解析一种特殊语言,它是 HTML 的一个子集(好吧,甚至不是真正的子集,因为您的文档没有经过验证).

您可能有充分的理由不想使用 HTML 解析器来解析这种特殊语言.例如,完整的 HTML 解析器的代码很大,否则可能不需要在您正在编写的非常小的嵌入式设备上.这更有可能是一项学习任务,目标是让您不是操纵字符串以选择最佳工具来产生您需要的输出.我假设您必须避免使用 HTML 库而不进一步考虑原因.

那么,如何解析这种特殊的语言呢?如何解析任何东西.鉴于我上面列出的所有限制,您可以非常简单地做到:

  • 在三个子串

    中任意一个的字符串中查找第一个实例代码>.这是您的开始标签.

  • 找到相应结束标记的第一个实例.
  • 之间的一切都是元素的内容.在您的示例中,您还会在内容的每一端修剪空白.构造一个 Elements 对象并将其添加到您的向量中(顺便说一下,请考虑使用单数的类名,而不是复数).
  • 重复字符串的其余部分.

就是这样.你可以使用正则表达式来做到这一点,但我的总体感觉是,既然你你想用 C++ 来做,那么你也可以用 C++ 来做.无需引入另一种语言,无论正则表达式的优点和局限性如何,它们肯定是另一种语言.

但是,我上面列出的额外限制可能无法保证.如果您以后想支持标签内的空格怎么办?还有属性?和 XML 命名空间?还有评论?然后你会希望你刚刚使用了一个 HTML 解析器.因此,您对 HTML 的固定琐碎子集所做的与您对重要子集或将来可能变得重要的子集所做的不同.

In my application I need to parse simple HTML code without using as less as possible external libs. My HTML looks like

<p> First Content is P </p><h2>Header</h2><p> Text under header </p>
<h2>Header 2</h2><p> Paragraph </p>
<h3>yep</h3><p> no </p>

My html contains only the tags p, h2, h3. I got the following structure:

struct Elements {
    std::string tag;
    std::string content;
};

std::vector<Elements> elems;

So my goal is after parsing each Elements in the vector should contain data like this:

tag = "h2"
content = "Header"

and

tag = "p"
content = "First Content is P"

PP: I need to get the elements in the order they're presented in the HTML.

Edit:

I just did this in javascript and it's working fine, but I have basically no idea how to write it down in c++:

var a = "<p> First Content is P </p><h2>Header</h2><p> Text under header </p>" +
    "<h2>Header 2</h2><p> Paragraph </p>" +
    "<h3>yep</h3><p> no </p>";

var output = [];

a.replace(/<\b[^>]*>(.*?)<\/(.*?)>/gmi, function(m, key, value) {
    output.push({
        tag: value,
        data: key
    });
})

/*
    output:
        { tag: "p", data: "First Content is P"},
        { tag: "h2", data: "Header" }
        .....
 */

解决方案

There are only those three elements, and no missing close tags. It looks as if furthermore there are no attributes on the tags, and aren't even any elements inside elements. There's no whitespace inside tags either.

Then you are not parsing HTML. You are parsing a special language that is a subset of HTML (well, not even really a subset since your document doesn't validate).

You might have a good reason not to want to use an HTML parser to parse this special language. For example, the code for a full HTML parser is large-ish and perhaps wouldn't otherwise need to be on the very tiny embedded device you're writing for. More likely this is a learning assignment, and the goal is for you to manipulate strings not to choose the best tool to produce the output you need. I will assume that you must avoid using an HTML library without further consideration why.

So, how to parse this special language? How to parse anything. Given all the restrictions I have listed above, you could do it very simply:

  • Look for the first instance in the string of any one of three substrings <p>, <h2>, <h3>. This is your opening tag.
  • Find the first instance of the corresponding close tag.
  • Everything between is the contents of the element. In your example you additionally trim whitespace at each end of the content. Construct an Elements object and add it to your vector (btw consider using a singular class name, not plural).
  • Repeat on the remainder of the string.

That's it. You could do that using a regular expression, but my general feeling is that since you said you wanted to do it in C++ then you may as well just do it in C++. No need to bring another language into it, and whatever the merits and limits of regexes, they certainly are another language.

However, maybe the extra limits I listed above aren't guaranteed. What if you later want to support spaces inside tags? And attributes? And XML namespaces? And comments? Then you'll wish you'd just used an HTML parser. Therefore what you do for a fixed trivial subset of HTML is different from what you do for a significant subset or one that might become significant in future.

这篇关于用纯 C++ 解析简单的 html的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆