用约束将字符串分割成较小的部分[PHP RegEx HTML] [英] Split string into smaller part with constrain [PHP RegEx HTML]

查看:99
本文介绍了用约束将字符串分割成较小的部分[PHP RegEx HTML]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将长字符串分割成具有以下约束的数组:

I need to split long string into a array with following constrains:

  • 输入将是HTML字符串,可以是整页或部分页面.
  • 每个部分(新字符串)的字符数都是有限的(例如,不超过8000个字符)
  • 每个部分可以包含多个句子(以.[句号分隔).但不能包含部分句子. 除非字符串的最后一部分(因为最后一部分可能没有句号.
  • 该字符串包含HTML标记.但是标记不能被划分为(<a href='test.html'><a href='test.和html'>).这意味着HTML标记应完好无损. 但是开始标签和结束标签可以保留在不同的细分/块中.
  • 如果任何中间句子大于期望的长度,则前导和尾随标记以及空格应位于数组的不同部分.即使这样做,如果句子较长,也可以将其划分为数组的多个元素:(
  • 请注意:无需解析HTML,而是标签(如like或其他)<.*>
  • The input will be HTML string, may be full page or partial.
  • Each part (new strings) will have a limited number of character (e.g. not more than 8000 character)
  • Each part can contain multiple sentences (delimited by . [full stop]) but never a partial sentences. Except if the last part of the string (as last part may not have any full stop.
  • The string contain HTML tags. But the tag can not be divided as (<a href='test.html'> to <a href='test. and html'>). That means HTML tag should be intact. But starting tag and ending tag can be stay on different segment/chunk.
  • If any middle sentence is greater than the desired length, then leading and trailing tags and white spaces should be in different part of the array. Even after do so, if the sentence is longer, then divide it into multiple element of the array :(
  • Please note that: No need to parse the HTML but tags (like or etc) <.*>

我认为带有preg_split的正则表达式可以做到这一点.请使用适当的RegEx帮助我.除正则表达式外,任何其他解决方案都欢迎.

I think regular expression with preg_split can do it. Would please help me with the proper RegEx. Any solution other than regex also welcome.

谢谢

萨迪

推荐答案

更正我,但我认为您不能使用简单的正则表达式来解决这个问题.在完整的regexp实施中,您可以使用如下所示的内容:

correct me if i'm wrong, but i don't think you can do this with a simple regexp. in a full regexp implementation you could use something like this :

$parts = preg_split("/(?<!<[^>]*)\./", $input);

但是php不允许在后面进行非固定长度的回溯,因此将无法正常工作.显然,仅有的两个是jgsoft和.net regexp. 有用的页面

but php does not allow non-fixed-length lookbehind, so that won't work. apparently the only 2 that do are jgsoft and the .net regexp. Useful Page

我处理这个问题的方法是:

my method of dealing with this would be :

function splitStringUp($input, $maxlen) {
    $parts = explode(".", $input);
    $i = 0;
    while ($i < count($parts)) {
        if (preg_match("/<[^>]*$/", $parts[$i])) {
            array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
        } else {
            if ($i < (count($parts) - 1) && strlen($parts[$i] . "." . $parts[$i+1]) < $maxlen) {
                array_splice($parts, $i, 2, $parts[$i] . "." . $parts[$i+1]);
            } else {
                $i++;
            }
        }
    }
    return $parts;
}

您没有提到当单个句子的长度大于8000个字符时想要发生什么,因此这使它们完整无缺.

you didn't mention what you want to happen when an individual sentence is >8000 chars long, so this just leaves them intact.

示例输出:

splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 8000);
array(1) {
  [0]=> string(114) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag. and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 80);
array(2) {
  [0]=> string(81) "this is a sentence. this is another sentence. this is an html <a href="a.b.c">tag"
  [1]=> string(32) " and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 40);
array(4) {
  [0]=> string(18) "this is a sentence"
  [1]=> string(25) " this is another sentence"
  [2]=> string(36) " this is an html <a href="a.b.c">tag"
  [3]=> string(32) " and the closing tag</a>. hooray"
}

splitStringUp("this is a sentence. this is another sentence. this is an html <a href=\"a.b.c\">tag. and the closing tag</a>. hooray", 0);
array(5) {
  [0]=> string(18) "this is a sentence"
  [1]=> string(25) " this is another sentence"
  [2]=> string(36) " this is an html <a href="a.b.c">tag"
  [3]=> string(24) " and the closing tag</a>"
  [4]=> string(7) " hooray"
}

这篇关于用约束将字符串分割成较小的部分[PHP RegEx HTML]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆