PHP preg_split在空格上,但不在标签内 [英] PHP preg_split on spaces, but not within tags

查看:119
本文介绍了PHP preg_split在空格上,但不在标签内的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 preg_split(/ \[^ \] * \(* SKIP)(* F)| \x20 /,$ input_line); 并在 phpliveregex.com 上运行
它会产生数组:

 数组(10 
0 =>< b>测试< / b>
1 =>或
2 =>< em> oh
3 => yeah< / em>
4 =>和
5 =>< i>
6 => ;
7 => yeah
8 =>< / i>
9 =>you we'hold it it

不是我想要的,它应该由仅在html标签之外的空格分隔,如下所示:

  array(5 
0 =>< b>测试< / b>
1 =>或
2 =>< em> oh yeah< / em>
3 =>和
4 => i oh yeah
5 =>ye我们'持有'它
b


$ p $在这个正则表达式我只能添加异常双引号,但真的需要帮助来添加更多内容,如标签< img /><一个>< / A><预>< /预><代码>< /代码><强>< /强>< b取代;< / B>< EM>< / em>< i>< / i>



因为你不需要描述一个html标签是什么,所以使用 DOMDocument 会更容易一些。 以及它的外观。你只需要检查nodeType。当它是一个textNode时,用 preg_match_all 拆分它(它比为 preg_split 设计模式更方便)

  $ html ='文字节点中的空格< b>测试< / b>或< em>噢是的< / em>和< i>哦是的< / i> 
ye we \'hold\'it
最后未封闭的双引号;

$ dom =新的DOMDocument;
$ dom- > loadHTML('< div>。$ html。'< / div>',LIBXML_HTML_NOIMPLIED);

$ nodeList = $ dom-> documentElement-> childNodes;

$ results = [];

foreach($ nodeList as $ childNode){
if($ childNode-> nodeType == XML_TEXT_NODE&&
preg_match_all('〜[^ \s] + |[^] *?〜',$ childNode-> nodeValue,$ m))
$ results = array_merge($ results,$ m [$ 0]);
else
$ results [] = $ dom-> saveHTML($ childNode);
}

print_r($ results);

注意:当双引号部分保持未关闭时,我选择了一个默认行为(没有关闭注意2:有时候 LIBXML _ 常量没有被定义。您可以解决此问题,然后在需要时定义它:

  if(!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED',8192);


i am using preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); and run it on phpliveregex.com it produce array :

array(10
  0=><b>test</b>
  1=>or
  2=><em>oh
  3=>yeah</em>
  4=>and
  5=><i>
  6=>oh
  7=>yeah
  8=></i>
  9=>"ye we 'hold' it"
)

NOT what i want, it should be seperate by spaces only outside html tags like this:

array(5
  0=><b>test</b>
  1=>or
  2=><em>oh yeah</em>
  3=>and
  4=><i>oh yeah</i>
  5=>"ye we 'hold' it"
)

in this regex i am only can add exception in "double quote" but realy need help to add more, like tag <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

any explanation about how that regex works also appreciate.

解决方案

It's easier to use the DOMDocument since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all (it's more handy than to design a pattern for preg_split):

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';

$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);

$nodeList = $dom->documentElement->childNodes;

$results = [];

foreach ($nodeList as $childNode) {
    if ($childNode->nodeType == XML_TEXT_NODE &&
        preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
        $results = array_merge($results, $m[0]);
    else
        $results[] = $dom->saveHTML($childNode);
}

print_r($results);

Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.

Note2: Sometimes LIBXML_ constants are not defined. You can solve this problem testing it before and defining it when needed:

if (!defined('LIBXML_HTML_NOIMPLIED'))
    define('LIBXML_HTML_NOIMPLIED', 8192);

这篇关于PHP preg_split在空格上,但不在标签内的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆