Wordwrap /剪切HTML字符串中的文本 [英] Wordwrap / Cut Text in HTML string

查看:95
本文介绍了Wordwrap /剪切HTML字符串中的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里我想要做的是:我有一个包含HTML标签的字符串,并且我想使用不包含HTML标签的wordwrap函数将其剪下。





  public function textWrap($ string,$ width)
{
$ dom = new DOMDocument( );
$ dom-> loadHTML($ string);
foreach($ dom-> getElementsByTagName('*')as $ elem)
{
foreach($ elem-> childNodes as $ node)
{
if($ node-> nodeType === XML_TEXT_NODE)
{
$ text = trim($ node-> nodeValue);
$ length = mb_strlen($ text);
$ width - = $ length;
if($ width< = 0)
{
//这里,我想删除所有下一个节点
//并剪切当前的nodeValue,最后返回字符串
}
}
}
}
}

我不知道我现在是否以正确的方式做到这一点。我希望这是明确的...



编辑:

我有这个文本

 < p> 
< span class =Underline>< span class =Bold>将要剪切的测试< / span>< / span>
< / p>< p>一些文字< / p>

假设我想在第6个字符处剪切它,我想返回这个:

 < p> 
< span class =Underline>< span class =Bold>测试至< / span>< / span>
< / p>


解决方案

正如我在评论中写到的,您首先需要找到文本偏移量在哪里进行剪切。

首先,我设置一个包含HTML片段的 DOMDocument ,然后在DOM中选择代表它的主体:

  $ htmlFragment =<<< HTML< b $ b< p> 
< span class =Underline>< span class =Bold>将要剪切的测试< / span>< / span>
< / p>< p>一些文字< / p>
HTML;

$ dom = new DOMDocument();
$ dom-> loadHTML($ htmlFragment);
$ parent = $ dom-> getElementsByTagName('body') - > item(0);
if(!$ parent)
{
抛出新的异常('Parent element not found。');
}

然后我使用我的 TextRange class来找到剪切需要完成的地方,然后使用 TextRange 来实际剪切并找到 DOMNode 应该成为片段的最后一个节点:

  $ range = new TextRange($ parent); 

//找到剪切HTML文本表示法的位置
//通过查找一个单词或至少匹配空白符
//使用正则表达式。
$ width = 17;
$ pattern = sprintf('〜^。{0,%d}(?<= \S)(?= \ s)| ^。{0,%1 $ d}(?= \\ \\ s)〜su',$ width);
$ r = preg_match($ pattern,$ range,$ matches);
if(FALSE === $ r)
{
抛出新的异常('Wordcut regex failed。');

if(!$ r)
{
throw new Exception(sprintf('Text%sis not cut-able(should not happen)。',$范围));





这个正则表达式找到了在文本表示中可用的剪切位置由 $范围。正则表达式模式是受另一个答案的启发,它讨论它更详细,并已稍作修改以适应此答案的需求。

  //切断文本节点以切入DOM可能
$ range-> split($比赛[0]);
$ nodes = $ range-> getNodes();
$ cutPosition = end($ nodes);

因为可能没有什么可以削减的(例如 body 将变为空),我需要处理这个特殊情况。否则 - 如注释中所述 - 需要删除所有节点

  //获取使用xpath删除的元素列表
if(FALSE === $ cutPosition)
{
//如果没有节点,删除所有父级子元素
$ cutPosition = $ parent ;
$ xpath ='child :: node()';
}
else
{
$ xpath ='following :: node()';
}

其余部分很简单:查询xpath,删除节点并输出结果:

  //执行xpath 
$ xp = new DOMXPath($ dom);
$ remove = $ xp-> query($ xpath,$ cutPosition);
if(!$ remove)
{
抛出新的异常('XPath查询无法获取要移除的元素');
}

//删除节点
foreach($ remove as $ node)
{
$ node-> parentNode-> removeChild($节点);
}

//内部HTML(PHP> = 5.3.6)
foreach($ parent-> childNodes as $ node)
{
echo $ dom-> saveHTML($ node);
}

完整的代码示例是 viper键盘上提供 incl。 TextRange 类。键盘有一个错误,所以它的结果不正确(相关: XPath查询结果顺序)。实际产出如下:

 < p> 
< span class =Underline>< span class =Bold>测试至< / span>< / span>< / p>

所以请注意您有一个当前的libxml版本(通常情况下)和输出 foreach 最后使用自PHP 5.3.6以后的PHP函数 saveHTML ,该函数可以使用该参数。如果您没有该PHP版本,请采取其他方式,如如何将节点的xml内容作为字符串获取<? / a>或类似的问题。



仔细查看示例代码时,您可能会注意到剪切长度非常大( $ width = 17; )。这是因为文本前面有很多空白字符。这可以通过使正则表达式在其前面放置任意数量的空白和/或先修剪 TextRange 来调整。第二个选项确实需要更多的功能,我写了一些可以在创建初始范围后使用的快捷方式:

  ... 
$ range = new TextRange($ parent);
$ trimmer = new TextRangeTrimmer($ range);
$ trimmer-> trim();
...

这将删除HTML片段左右两边的不必要的空格。 TextRangeTrimmer 代码如下:

  class TextRangeTrimmer 
{
/ **
* @var TextRange
* /
private $ range;

/ **
* @var数组
* /
私有$ charlist;

public function __construct(TextRange $ range,Array $ charlist = NULL)
{
$ this-> range = $ range;
$ this-> setCharlist($ charlist);

$ ** $ b $ @param array $ charlist UTF-8编码字符列表
* @throws InvalidArgumentException
* /
public function setCharlist (Array $ charlist = NULL)
{
if(NULL === $ charlist)
$ charlist = str_split(\ t\\\
\r\0\x0B )
;

$ list = array();
$ b foreach($ charlist as $ char)
{
if(!is_string($ char))
{
throw new InvalidArgumentException('Not an数组字符串');

if(strlen($ char))
{
$ list [] = $ char;
}
}

$ this-> charlist = array_flip($ list);
$ b $ **
* @return数组字符
* /
public function getCharlist()
{
return array_keys($ this - > charlist);

公共函数trim()
{
if(!$ this-> charlist)return;
$ this-> ltrim();
$ this-> rtrim();
}
/ **
* $ charlist的连续字符数从$ start到$ direction
*
* @param数组$ charlist
* @param int $ start offset
@param int $ direction 1:forward,-1:backward
* @throws InvalidArgumentException
* /
私有函数lengthOfCharacterSequence(Array $ charlist, $ start,$ direction = 1)
{
$ start =(int)$ start;
$ direction = max(-1,min(1,$ direction));
if(!$ direction)throw new InvalidArgumentException('Direction必须是1或-1。');

$ count = 0; $;
for(; $ char = $ this-> range-> getCharacter($ start),$ char!==''; $ start + = $ direction,$ count ++)
if(! isset($ charlist [$ char]))break;

返回$ count;

public function ltrim()
{
$ count = $ this-> lengthOfCharacterSequence($ this-> charlist,0);
$ b $ if if($ count)
{
$ remaining = $ this-> range-> split($ count);
foreach($ this-> range-> getNodes()as $ textNode)
{
$ textNode-> parentNode-> removeChild($ textNode);
}
$ this-> range-> setNodes($ remaining-> getNodes());


$ b public function rtrim()
{
$ count = $ this-> lengthOfCharacterSequence($ this-> charlist, - 1,-1); ($ count)
{
$ chop = $ this-> range-> split( - $ count);


foreach($ chop-> getNodes()as $ textNode)
{
$ textNode-> parentNode-> removeChild($ textNode);
}
}
}
}

希望这很有帮助。


here what i want to do : i have a string containing HTML tags and i want to cut it using the wordwrap function excluding HTML tags.

I'm stuck :

public function textWrap($string, $width)
{
    $dom = new DOMDocument();
    $dom->loadHTML($string);
    foreach ($dom->getElementsByTagName('*') as $elem)
    {
        foreach ($elem->childNodes as $node)
        {
            if ($node->nodeType === XML_TEXT_NODE)
            {
                $text = trim($node->nodeValue);
                $length = mb_strlen($text);
                $width -= $length;
                if($width <= 0)
                { 
                    // Here, I would like to delete all next nodes
                    // and cut the current nodeValue and finally return the string 
                }
            }
        }
    }
}

I'm not sure i'm doing it in the right way at the moment. I hope it's clear...

EDIT :

Here an example. I have this text

    <p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text</p>

Let's say I want to cut it at the 6th character, I would like to return this :

<p>
    <span class="Underline"><span class="Bold">Test to</span></span>
</p>

解决方案

As I wrote in a comment, you first need to find the textual offset where to do the cut.

First of all I setup a DOMDocument containing the HTML fragment and then selecting the body which represents it in the DOM:

$htmlFragment = <<<HTML
<p>
        <span class="Underline"><span class="Bold">Test to be cut</span></span>
   </p><p>Some text </p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
    throw new Exception('Parent element not found.');
}

Then I use my TextRange class to find the place where the cut needs to be done and I use the TextRange to actually do the cut and locate the DOMNode that should become the last node of the fragment:

$range = new TextRange($parent);

// find position where to cut the HTML textual represenation
// by looking for a word or the at least matching whitespace
// with a regular expression. 
$width = 17;
$pattern = sprintf('~^.{0,%d}(?<=\S)(?=\s)|^.{0,%1$d}(?=\s)~su', $width);
$r = preg_match($pattern, $range, $matches);
if (FALSE === $r)
{
    throw new Exception('Wordcut regex failed.');
}
if (!$r)
{
    throw new Exception(sprintf('Text "%s" is not cut-able (should not happen).', $range));
}

This regular expression finds the offset where to cut things in the textual representation made available by $range. The regex pattern is inspired by another answer which discusses it more detailed and has been slightly modified to fit this answers needs.

// chop-off the textnodes to make a cut in DOM possible
$range->split($matches[0]);
$nodes = $range->getNodes();
$cutPosition = end($nodes);

As it can be possible that there is nothing to cut (e.g. the body will become empty), I need to deal with that special case. Otherwise - as noted in the comment - all following nodes need to be removed:

// obtain list of elements to remove with xpath
if (FALSE === $cutPosition)
{
    // if there is no node, delete all parent children
    $cutPosition = $parent;
    $xpath = 'child::node()';
}
else
{
    $xpath = 'following::node()';
}

The rest is straight forward: Query the xpath, remove the nodes and output the result:

// execute xpath
$xp = new DOMXPath($dom);
$remove = $xp->query($xpath, $cutPosition);
if (!$remove)
{
    throw new Exception('XPath query failed to obtain elements to remove');
}

// remove nodes
foreach($remove as $node)
{
    $node->parentNode->removeChild($node);
}

// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
    echo $dom->saveHTML($node);
}

The full code example is available on viper codepad incl. the TextRange class. The codepad has a bug so it's result is not properly (Related: XPath query result order). The actual output is the following:

<p>
        <span class="Underline"><span class="Bold">Test to</span></span></p>

So take care you have a current libxml version (normally the case) and the output foreach at the end makes use of a PHP function saveHTML which is available with that parameter since PHP 5.3.6. If you don't have that PHP version, take some alternative like outlined in How to get the xml content of a node as a string? or a similar question.

When you closely look in my example code you might notice that the cut length is quite large ($width = 17;). That is because there are many whitespace characters in front of the text. This could be tweaked by making the regular expression drop any number of whitespace in fron t of it and/or by trimming the TextRange first. The second option does need more functionality, I wrote something quick that can be used after creating the initial range:

...
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->trim();
...

That would remove the needless whitespace on left and right inside your HTML fragment. The TextRangeTrimmer code is the following:

class TextRangeTrimmer
{
    /**
     * @var TextRange
     */
    private $range;

    /**
     * @var array
     */
    private $charlist;

    public function __construct(TextRange $range, Array $charlist = NULL)
    {
        $this->range = $range;
        $this->setCharlist($charlist);      
    }
    /**
     * @param array $charlist list of UTF-8 encoded characters
     * @throws InvalidArgumentException
     */
    public function setCharlist(Array $charlist = NULL)
    {
         if (NULL === $charlist)
            $charlist = str_split(" \t\n\r\0\x0B")
        ;

        $list = array();

        foreach($charlist as $char)
        {
            if (!is_string($char))
            {
                throw new InvalidArgumentException('Not an Array of strings.');
            }
            if (strlen($char))
            {
                $list[] = $char; 
            }
        }

        $this->charlist = array_flip($list);
    }
    /**
     * @return array characters
     */
    public function getCharlist()
    {
        return array_keys($this->charlist);
    }
    public function trim()
    {
        if (!$this->charlist) return;
        $this->ltrim();
        $this->rtrim();
    }
    /**
     * number of consecutive charcters of $charlist from $start to $direction
     * 
     * @param array $charlist
     * @param int $start offset
     * @param int $direction 1: forward, -1: backward
     * @throws InvalidArgumentException
     */
    private function lengthOfCharacterSequence(Array $charlist, $start, $direction = 1)
    {
        $start = (int) $start;              
        $direction = max(-1, min(1, $direction));
        if (!$direction) throw new InvalidArgumentException('Direction must be 1 or -1.');

        $count = 0;
        for(;$char = $this->range->getCharacter($start), $char !== ''; $start += $direction, $count++)
            if (!isset($charlist[$char])) break;

        return $count;
    }
    public function ltrim()
    {
        $count = $this->lengthOfCharacterSequence($this->charlist, 0);

        if ($count)
        {
            $remainder = $this->range->split($count);
            foreach($this->range->getNodes() as $textNode)
            {
                $textNode->parentNode->removeChild($textNode);
            }
            $this->range->setNodes($remainder->getNodes());
        }

    }
    public function rtrim()
    {
        $count = $this->lengthOfCharacterSequence($this->charlist, -1, -1);

        if ($count)
        {
            $chop = $this->range->split(-$count);
            foreach($chop->getNodes() as $textNode)
            {
                $textNode->parentNode->removeChild($textNode);
            }
        }
    }
}

Hope this is helpful.

这篇关于Wordwrap /剪切HTML字符串中的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆