使用CodeIgniter的正则表达式缩小最终HTML输出 [英] Minifying final HTML output using regular expressions with CodeIgniter

查看:163
本文介绍了使用CodeIgniter的正则表达式缩小最终HTML输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Google网页建议您缩小HTML,也就是删除所有不必要的空格。
CodeIgniter 具有giziping输出的功能,或者可以通过 .htaccess
但是我还想从最终的HTML输出中删除不必要的空格。

Google pages suggest you to minify HTML, that is, remove all the unnecessary spaces. CodeIgniter does have the feature of giziping output or it can be done via .htaccess. But still I also would like to remove unnecessary spaces from the final HTML output as well.

我用这段代码做了一点,和它似乎工作。
这确实导致HTML没有多余的空格,并删除其他标签格式。

I played a bit with this piece of code to do it, and it seems to work. This does indeed result in HTML that is without excess spaces and removes other tab formatting.

class Welcome extends CI_Controller 
{
    function _output()
    {
        echo preg_replace('!\s+!', ' ', $output);
    }

    function index(){
    ...
    }
}

问题是可能有像
< pre> < textarea> 等,它们中可能有空格,正则表达式应该删除它们。
那么,如何从最终的HTML中删除多余的空间,而不使用正则表达式对这些标签的空格或格式化?

The problem is there may be tags like <pre>,<textarea>, etc.. which may have spaces in them and a regular expression should remove them. So, how do I remove excess space from the final HTML, without effecting spaces or formatting for these certain tags using a regular expression?

感谢@Alan Moore得到了答案,这对我有用。

Thanks to @Alan Moore got the answer, this worked for me

echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);

ridgerunner在分析这个正则表达式时做得很好。我最终使用他的解决方案。欢迎来到ridgerunner。

ridgerunner did a very good job of analyzing this regular expression. I ended up using his solution. Cheers to ridgerunner.

推荐答案

对于那些好奇的Alan Moore的正则表达式是如何工作的>工作),我已经采取了自由的评论它,所以它可以读的只是凡人:

For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

我在这里是新的,但我可以看到,相当不错的regex。我只会添加以下建议。

I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.


  1. 有一个不必要的捕获组,可以删除。

  2. 虽然OP没有说因此,< SCRIPT> 元素应该添加到< PRE> < TEXTAREA> 黑名单。

  3. 添加'S' PCREstudy修改器可将正则表达式加速约20%。

  4. 在前瞻中有一个替换组,可以应用Friedl的展开循环效率结构。

  5. 注意,该相同的交替组:(即(?:[^<] ++ | <(?!/?(?: textarea | pre)\b))* + )容易在大型目标字符串上进行过多的PCRE递归,这可能导致堆栈溢出导致Apache / PHP可执行文件静默地 seg-fault和崩溃,没有警告。 (Apache httpd.exe 的Win32版本特别容易受到影响,因为它只有256KB的堆栈,而* nix可执行文件通常是用8MB或更多的堆栈来构建的。 )Philip Hazel(PHP中使用的PCRE正则表达式引擎的作者)在文档中讨论了此问题: PCRE讨论堆栈使用。虽然Alan已经正确地应用了与本文档中的Philip显示的相同的修复(对第一个选项应用占有性加法),但如果HTML文件很大并且有大量未列入黑名单的标签,则仍然会有很多递归。例如在我的Win32框(有一个256KB堆栈的可执行文件),脚本炸毁了一个只有60KB的测试文件。还要注意,PHP不幸的是不遵循建议,并设置默认递归限制方式太高在100000.(根据PCRE docs这应该设置一个值等于堆栈大小除以500)。

  1. There is an unnecessary capture group which can be removed.
  2. Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist.
  3. Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
  4. There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
  5. On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)\b))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).

这是一个改进的版本,比原始版本更快,处理更大的输入,如果输入字符串太大而无法处理,则会优雅地失败并显示一条消息:

Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

p.s。我非常熟悉这个PHP / Apache seg-fault问题,因为我参与帮助Drupal社区,当他们摔跤这个问题。请参阅:优化CSS选项会导致php cgi在pcre函数匹配中出现错误。我们还在FluxBB论坛软件项目上使用BBCode解析器体验了这一点。

p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.

希望这有助于。

这篇关于使用CodeIgniter的正则表达式缩小最终HTML输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆