使用Boost regex在C ++中缩小HTML [英] Minify HTML with Boost regex in C++

查看:197
本文介绍了使用Boost regex在C ++中缩小HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用C ++缩小HTML?

How to minify HTML using C++?

外部库可能是答案,但我更寻找改进我当前的代码。

An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.

这是我在c ++中解释的< a href =http://stackoverflow.com/a/5324014/570796>以下回答。

This is my interpretation in c++ of the following answer.

我唯一的一部分从原始帖子更改此部分在顶部:(?ix)

...和一些转义符号

#include <boost/regex.hpp>
void minifyhtml(string* s) {
  boost::regex nowhitespace(
    "(?ix)"
    "(?>"           // Match all whitespans other than single space.
    "[^\\S ]\\s*"   // Either one [\t\r\n\f\v] and zero or more ws,
    "| \\s{2,}"     // or two or more consecutive-any-whitespace.
    ")"             // Note: The remaining regex consumes no text at all...
    "(?="           // Ensure we are not in a blacklist tag.
    "[^<]*+"        // Either zero or more non-"<" {normal*}
    "(?:"           // Begin {(special normal*)*} construct
    "<"             // or a < starting a non-blacklist tag.
    "(?!/?(?:textarea|pre|script)\\b)"
    "[^<]*+"        // more non-"<" {normal*}
    ")*+"           // Finish "unrolling-the-loop"
    "(?:"           // Begin alternation group.
    "<"             // Either a blacklist start tag.
    "(?>textarea|pre|script)\\b"
    "| \\z"         // or end of file.
    ")"             // End alternation group.
    ")"             // If we made it here, we are not in a blacklist tag.
  );

  // @todo Don't remove conditional html comments
  boost::regex nocomments("<!--(.*)-->");

  *s = boost::regex_replace(*s, nowhitespace, " ");
  *s = boost::regex_replace(*s, nocomments, "");
}

只有第一个正则表达式来自原始帖子,工作,应该被认为远不完全。

Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.

推荐答案

正则表达式是一个强大的工具,但我认为在这种情况下使用它们将是一个坏主意。例如,您提供的regexp是维护噩梦。通过查看这个正则表达式,你不能很快地理解它应该匹配什么。

Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.

你需要一个html解析器来标记输入文件,或者允许你访问令牌作为流或作为对象树。基本上读取令牌,丢弃那些不需要的令牌和属性,然后将剩余的内容写入输出。使用类似这样的方法,您可以比使用正则表达式解决问题更快地开发解决方案。

You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.

我使用xml解析器或者你可以搜索支持html的xml解析器。

I think you might be able to use xml parser or you could search for xml parser with html support.

在C ++中,libxml(可能有HTML支持模块),Qt 4,tinyxml和libstrophe某些类型的xml解析器可以工作。

In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.

请注意,C ++(尤其是C ++ 03)可能不是这种程序的最佳语言。虽然我非常不喜欢python,python有美丽的汤模块,将工作非常好的这种问题。

Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.

Qt 4可能会工作,因为它提供了体面的unicode字符串类型(如果你要解析html,你需要它)。

Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

这篇关于使用Boost regex在C ++中缩小HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆