截断包含HTML的文本,忽略标记 [英] Truncate text containing HTML, ignoring tags

查看:99
本文介绍了截断包含HTML的文本,忽略标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想截断一些文本(从数据库或文本文件中加载),但它包含HTML,因此包含标签并返回更少的文本。这可能会导致标签未被关闭或部分关闭(因此Tidy可能无法正常工作并且内容较少)。如何根据文本截断(并且可能在到达表格时停止,因为这可能导致更复杂的问题)。

  substr(您好,我的< strong>名称< / strong>是< em> Sam< / em>我和&web开发者。,0,26)。...

会导致:

 您好,我的< strong>名称< / st ... 

想要的是:

 您好,我的< strong>名称< / strong>是< em> Sam< / em> ;. I& acute; m ... 

我该如何操作?



虽然我的问题是如何在PHP中执行此操作,但知道如何在C#中执行该操作是很好的......或者应该可以,因为我认为我可以移植该方法(除非它是一个内置的方法)。

另外请注意,我已经包含一个HTML实体& acute; - 必须将其视为单个字符(而不是本例中的7个字符)。



strip_tags 是一个后备,但我会失去格式和链接,它仍然有问题与HTML实体。

假设您使用的是有效的XHTML,那么解析HTML并确保正确处理标签很简单。您只需跟踪到目前为止已打开哪些标签,并确保在出门时再次关闭它们。

 <?php 
header('Content-type:text / plain; charset = utf-8');

函数printTruncated($ maxLength,$ html,$ isUtf8 = true)
{
$ printedLength = 0;
$ position = 0;
$ tags = array();

//对于UTF-8,我们需要将多字节序列计为一个字符。
$ re = $ isUtf8
? {< /([AZ] +)[^>] * GT; |?&安培;#[A-ZA-Z0-9] +; |?[\x80-\xFF] [\x80 '[b-1]];} {
:'{< /?([az] +)[^>]> |&#?[a-zA-Z0-9] +;} ;

while($ printedLength< $ maxLength&& preg_match($ re,$ html,$ match,PREG_OFFSET_CAPTURE,$ position))
{
list($ tag ,$ tagPosition)= $ match [0];

//打印标签前的文本。
$ str = substr($ html,$ position,$ tagPosition - $ position);
if($ printedLength + strlen($ str)> $ maxLength)
{
print(substr($ str,0,$ maxLength - $ printedLength));
$ printedLength = $ maxLength;
休息;
}

print($ str);
$ printedLength + = strlen($ str);
if($ printedLength> = $ maxLength)break; ($ tag [0] =='&'|| ord($ tag)> = 0x80)


{
//将实体或UTF- 8个多字节序列通过不变。
print($ tag);
$ printedLength ++;
}
else
{
//处理标签。
$ tagName = $ match [1] [0];
if($ tag [1] =='/')
{
//这是一个结束标签。

$ openingTag = array_pop($ tags);
assert($ opensTag == $ tagName); //检查标签是否嵌套正确。

print($ tag);
}
else if($ tag [strlen($ tag) - 2] =='/')
{
//自闭标签。
print($ tag);
}
else
{
//打开标签。
print($ tag);
$ tags [] = $ tagName;
}
}

//在标签后继续。
$ position = $ tagPosition + strlen($ tag);
}

//打印剩余的文本。
if($ printedLength< $ maxLength&& $ position< strlen($ html))
print(substr($ html,$ position,$ maxLength - $ printedLength));

//关闭所有打开的标签。
while(!empty($ tags))
printf('< /%s>',array_pop($ tags));
}


printTruncated(10,'< b& lt; Hello& gt;< / b>< img src =world.pngalt =/> world!');打印( \\\
);
$ b printTruncated(10,'< table>< tr>< td> Heck,< / td>< td>扔< / td>< / tr>< tr> < / td>< td>表< / td>< / tr>< / table>');打印( \\\
);
$ b printTruncated(10,< em>< b> Hello< / b&#20; w \xC3\xB8rld!&em;>);打印( \\\
);

编码说明:以上代码假设XHTML为 UTF-8 编码。也支持ASCII兼容的单字节编码(如 Latin-1 ),只是作为第三个参数传递 false 。其他多字节编码不受支持,但您可能在调用函数之前使用 mb_convert_encoding 转换为UTF-8,然后在每个 print 语句。



(但您应始终使用UTF-8)



编辑:已更新以处理字符实体和UTF-8。修正了如果该字符是字符实体,该函数将打印一个字符太多的错误。


I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

Hello, my <strong>name</st...

What I would want is:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

Also note that I have included an HTML entity &acute; - which would have to be considered as a single character (rather than 7 characters as in this example).

strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

解决方案

Assuming you are using valid XHTML, it's simple to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

<?php
header('Content-type: text/plain; charset=utf-8');

function printTruncated($maxLength, $html, $isUtf8=true)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    // For UTF-8, we need to count multibyte sequences as one character.
    $re = $isUtf8
        ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}'
        : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}';

    while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);
        if ($printedLength >= $maxLength) break;

        if ($tag[0] == '&' || ord($tag) >= 0x80)
        {
            // Pass the entity or UTF-8 multibyte sequence through unchanged.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); print("\n");

Encoding note: The above code assumes the XHTML is UTF-8 encoded. ASCII-compatible single-byte encodings (such as Latin-1) are also supported, just pass false as the third argument. Other multibyte encodings are not supported, though you may hack in support by using mb_convert_encoding to convert to UTF-8 before calling the function, then converting back again in every print statement.

(You should always be using UTF-8, though.)

Edit: Updated to handle character entities and UTF-8. Fixed bug where the function would print one character too many, if that character was a character entity.

这篇关于截断包含HTML的文本,忽略标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆