解析简单的HTML纯C ++ [英] Parse simple html with pure C++

查看:154
本文介绍了解析简单的HTML纯C ++的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的应用程序需要解析简单的HTML code,而不使用尽可能少的外部库。我的HTML看起来像

 < P>第一个内容是P< / P>< H2>页眉和LT; / H>< P>在头&LT文本; / P>
&所述; H2>头2'; / H2>&所述p为H.;段与所述; / P>
&所述; H3>是的&下; / H3>&所述p为H.;否所述; / P>

我的HTML只包含标记 P,H2,H3 。我有以下结构:

 结构元素{
    性病::字符串标记;
    标准::字符串的内容;
};的std ::矢量<元素> elems;

所以我的目标是在解析向量元素各自应该包含这样的数据之后:

 标签=H2
内容=头

 标签=P
内容=第一个内容是P

PP:我需要在他们在HTML psented $ P $顺序的元素

编辑:

我只是做这在JavaScript和它的工作很好,但我已经基本不知道如何把它写在C ++:

  VAR一个=< P>首先内容是P< / P>< H2>头< / H>< P>在头&LT文本; / P>  +
    &所述; H2>头2'; / H2>&所述p为H.;段&下; / P>中+
    &所述; H3>是的&下; / H3>&所述p为H.否下; / P>中;VAR输出= [];a.replace(/< \\ B〔^>] *方式>(*)< \\ /(*)方式>?/ GMI,功能(M,键,值){
    output.push({
        标签:值,
        数据:关键
    });
})/ *
    输出:
        {标签:P,数据:第一内容是P},
        {标签:H2,数据头}
        .....
 * /


解决方案

有只有这三个要素,且不会丢失关闭标签。这看起来好像此外,还有标签上的任何属性,甚至没有内部元素的任何元素。里面有没有标签或者空格

随后的你是不是解析HTML 的。您正在分析一种特殊的语言是HTML的一个子集(当然,即使没有真正的一个子集,因为您的文档不验证)。

您可能有一个很好的理由不希望使用HTML解析器来解析这个特殊的语言。例如,code一个完整的HTML解析器是大十岁上下,或许不会否则需要是你写的非常微小的嵌入式设备上。更有可能的是,这是一个学习的任务,其目标是为您操作字符串的的选择来生产您需要输出的最佳工具。我会假设你必须避免使用HTML库没有进一步考虑为什么。

那么,如何分析这种特殊的语言?如何解析什么。鉴于我上面列出的所有限制,你可以做到这一点很干脆:


  • 三个子任何一个字符串中寻找一审< P> < H2> < H3> 。这是你的开始标记。

  • 找到相应的关闭标记的第一个实例。

  • 之间的内容是元素的内容。在您的例子中,你另外在内容的每一端修剪空白。构造一个元素对象并将其添加到您的载体(BTW考虑使用单一类名,不是复数)。

  • 在字符串的其余部分重复。

就是这样。你可以这样做:当使用普通的前pression,但我总的感觉是因为你的的你想要做它在C ++中,那么你可能也只是做它在C ++。没有必要把另一种语言进去,无论优点和正则表达式的限制,他们肯定是另一种语言。

然而的,也许我上面列出的额外限制不能保证。如果你以后要支持内标签的空间?和属性?和XML命名空间?和评论?然后,你会希望你刚刚使用的HTML解析器。因此,你的HTML的固定琐碎的一部分做的是从你的显著子集或一个有可能成为未来显著做什么不同的。

In my application I need to parse simple HTML code without using as less as possible external libs. My HTML looks like

<p> First Content is P </p><h2>Header</h2><p> Text under header </p>
<h2>Header 2</h2><p> Paragraph </p>
<h3>yep</h3><p> no </p>

My html contains only the tags p, h2, h3. I got the following structure:

struct Elements {
    std::string tag;
    std::string content;
};

std::vector<Elements> elems;

So my goal is after parsing each Elements in the vector should contain data like this:

tag = "h2"
content = "Header"

and

tag = "p"
content = "First Content is P"

PP: I need to get the elements in the order they're presented in the HTML.

Edit:

I just did this in javascript and it's working fine, but I have basically no idea how to write it down in c++:

var a = "<p> First Content is P </p><h2>Header</h2><p> Text under header </p>" +
    "<h2>Header 2</h2><p> Paragraph </p>" +
    "<h3>yep</h3><p> no </p>";

var output = [];

a.replace(/<\b[^>]*>(.*?)<\/(.*?)>/gmi, function(m, key, value) {
    output.push({
        tag: value,
        data: key
    });
})

/*
    output:
        { tag: "p", data: "First Content is P"},
        { tag: "h2", data: "Header" }
        .....
 */

解决方案

There are only those three elements, and no missing close tags. It looks as if furthermore there are no attributes on the tags, and aren't even any elements inside elements. There's no whitespace inside tags either.

Then you are not parsing HTML. You are parsing a special language that is a subset of HTML (well, not even really a subset since your document doesn't validate).

You might have a good reason not to want to use an HTML parser to parse this special language. For example, the code for a full HTML parser is large-ish and perhaps wouldn't otherwise need to be on the very tiny embedded device you're writing for. More likely this is a learning assignment, and the goal is for you to manipulate strings not to choose the best tool to produce the output you need. I will assume that you must avoid using an HTML library without further consideration why.

So, how to parse this special language? How to parse anything. Given all the restrictions I have listed above, you could do it very simply:

  • Look for the first instance in the string of any one of three substrings <p>, <h2>, <h3>. This is your opening tag.
  • Find the first instance of the corresponding close tag.
  • Everything between is the contents of the element. In your example you additionally trim whitespace at each end of the content. Construct an Elements object and add it to your vector (btw consider using a singular class name, not plural).
  • Repeat on the remainder of the string.

That's it. You could do that using a regular expression, but my general feeling is that since you said you wanted to do it in C++ then you may as well just do it in C++. No need to bring another language into it, and whatever the merits and limits of regexes, they certainly are another language.

However, maybe the extra limits I listed above aren't guaranteed. What if you later want to support spaces inside tags? And attributes? And XML namespaces? And comments? Then you'll wish you'd just used an HTML parser. Therefore what you do for a fixed trivial subset of HTML is different from what you do for a significant subset or one that might become significant in future.

这篇关于解析简单的HTML纯C ++的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆