C＃截断HTML安全，为文章概要 [英] c# Truncate HTML safely for article summary

查看：158 发布时间：2016/9/8 17:21:02 c# html regex

本文介绍了C＃截断HTML安全，为文章概要的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有没有人有这种交流＃变化？

这是这样我就可以采取一些HTML和没有打破的一个总结导致的一篇文章显示呢？

截断包含文字的HTML，忽略标记

从重新发明轮子，救救我！

修改

对不起，新来的，和您的权利，应该措辞问题更好，继承人多一点信息

我希望在一个HTML字符串它截断字（甚至烧焦长度）的一组号码，然后我可以显示它开始作为一个汇总（然后引出主要的文章）。我希望保留HTML，所以我可以显示在预览链接等。

主要的问题，我要解决的是一个事实，即我们可能有未关闭的HTML结束标记，如果我们截断为1或多个标签的中间<！/ p>

的想法我有解决方案是

<醇>

截断HTML到N字（词更好，但字符OK）第（千万不要停在标签的中间，截断要求属性）

通过本截断字符串打开html标签的工作（也许粘在堆栈，因为我去？）

然后通过关闭标签工作，并确保它们符合堆栈的那些，我突然不送？

如果任何打开的标签留在后这个堆栈，然后写他们结束截断字符串和HTML的应该是好去!!!!

编辑2009/12/11

这是我迄今为止瞎遛一起在VS2008一个单元测试文件，这个可能帮助别人在未来

我劈基于代码一月都尝试在顶部焦炭版+ word版（免责声明：这是肮脏粗糙的代码！对我而言）

我认为在所有情况下，结构良好的HTML工作（但不一定是一个完整的文档与根节点根据XML版本）

Abels XML版本是在底部，但尚未得到全面充分得到测试然而，在这运行（加上需要了解的代码）...

我会更新当我得到机会来完善

具有张贴代码麻烦吗？有没有上传设备堆栈

感谢所有的意见：？）

 使用系统; 
使用System.Collections.Generic;使用System.Text.RegularExpressions 
; 
使用的System.Xml; 
使用System.Xml.XPath; 
使用Microsoft.VisualStudio.TestTools.UnitTesting; 
 
命名空间PINET40TestProject 
 {
 [TestClass中] 
公共类UtilityUnitTest 
 {
公共静态字符串TruncateHTMLSafeishChar（字符串文字，诠释charCount）
 {
布尔inTag = FALSE; 
 INT CNTR = 0; 
 INT cntrContent = 0; 
 
 //通过HTML循环，只计算可视内容
的foreach（文本字符C）
 {
如果（cntrContent == charCount）破; 
 CNTR ++; 
如果（C =='<'）
 {
 inTag = TRUE; 
继续; 
} 
 
如果（C =='>'）
 {
 inTag = FALSE; 
继续; 
} 
如果cntrContent ++（inTag！）; 
} 
 
串SUBSTR = text.Substring（0，CNTR）; 
 
 //搜索标签非闭合
 MatchCollection openedTags =新的正则表达式（< [^ /]（| \\\
）*方式>？）。匹配（SUBSTR）; 
 MatchCollection closedTags =新的正则表达式（< [/]（| \\\
）*方式>？）匹配（SUBSTR）。 
 
 //创建堆栈
堆栈<串GT; opentagsStack =新的堆栈<串GT;（）; 
堆栈<串GT; closedtagsStack =新的堆栈<串GT;（）; 
 
 //说实话，这似乎是一个好主意，然后，我就一路上
迷失//这样的逻辑可能是命悬一线！ 
的foreach（在openedTags匹配标签）
 {
串openedtag = tag.Value.Substring（1 tag.Value.Length  -  2）; 
 //删除任何属性，当然我们可以用正则表达式这个！ 
如果（openedtag.IndexOf（）> = 0）
 {
 openedtag = openedtag.Substring（0，openedtag.IndexOf（））; 
} 
 
 //忽略BRS自我封闭的
如果（openedtag.Trim（）！=BR）
 {
 opentagsStack.Push （openedtag）; 
} 
} 
 
的foreach（在closedTags匹配标签）
 {
串closedtag = tag.Value.Substring（2 tag.Value.Length -  3）; 
 closedtagsStack.Push（closedtag）; 
} 
 
如果（closedtagsStack.Count< opentagsStack.Count）
 {
，而（opentagsStack.Count大于0）
 {
串tagstr = opentagsStack.Pop（）; 
 
如果（closedtagsStack.Count == 0 || tagstr = closedtagsStack.Peek（）！）
 {
 + SUBSTR =< /+ tagstr +>  
} 
，否则
 {
 closedtagsStack.Pop（）; 
} 
} 
} 
 
返回SUBSTR; 
} 
 
公共静态字符串TruncateHTMLSafeishWord（字符串文本的wordCount INT）
 {
布尔inTag = FALSE; 
 INT CNTR = 0; 
 INT cntrWords = 0; 
字符lastc =''; 
 
 //通过HTML循环，只计算可视内容
的foreach（文本字符C）
 {
如果（cntrWords的wordCount ==）破; 
 CNTR ++; 
如果（C =='<'）
 {
 inTag = TRUE; 
继续; 
} 
 
如果（C =='>'）
 {
 inTag = FALSE; 
继续; 
} 
如果（！inTag）
 {
 //不要指望双空格，作为一个单词
如果（三一在标签数没有空间= = 32安培;&安培; lastc = 32）
 cntrWords ++！; 
} 
} 
 
串SUBSTR = text.Substring（0，CNTR）+...; 
 
 //搜索标签非闭合
 MatchCollection openedTags =新的正则表达式（< [^ /]（| \\\
）*方式>？）。匹配（SUBSTR）; 
 MatchCollection closedTags =新的正则表达式（< [/]（| \\\
）*方式>？）匹配（SUBSTR）。 
 
 //创建堆栈
堆栈<串GT; opentagsStack =新的堆栈<串GT;（）; 
堆栈<串GT; closedtagsStack =新的堆栈<串GT;（）; 
 
的foreach（在openedTags匹配标签）
 {
串openedtag = tag.Value.Substring（1 tag.Value.Length  -  2）; 
 //删除任何属性，当然我们可以用正则表达式这个！ 
如果（openedtag.IndexOf（）> = 0）
 {
 openedtag = openedtag.Substring（0，openedtag.IndexOf（））; 
} 
 
 //忽略BRS自我封闭的
如果（openedtag.Trim（）！=BR）
 {
 opentagsStack.Push （openedtag）; 
} 
} 
 
的foreach（在closedTags匹配标签）
 {
串closedtag = tag.Value.Substring（2 tag.Value.Length -  3）; 
 closedtagsStack.Push（closedtag）; 
} 
 
如果（closedtagsStack.Count< opentagsStack.Count）
 {
，而（opentagsStack.Count大于0）
 {
串tagstr = opentagsStack.Pop（）; 
 
如果（closedtagsStack.Count == 0 || tagstr = closedtagsStack.Peek（）！）
 {
 + SUBSTR =< /+ tagstr +>  
} 
，否则
 {
 closedtagsStack.Pop（）; 
} 
} 
} 
 
返回SUBSTR; 
} 
 
公共静态字符串TruncateHTMLSafeishCharXML（字符串文字，诠释charCount）
 {
 //你的数据，可能来自某个地方，或作为PARAMS到methodint 
 XmlDocument的XML =新的XmlDocument（）; 
 xml.LoadXml（文本）; 
 //创建一个导航仪，这是我们的主要工具
 XPathNavigator的导航= xml.CreateNavigator（）; 
的XPathNavigator断点= NULL; 
 
 //找到我们所需要的文本节点：
，而（navigator.MoveToFollowing（XPathNodeType.Text））
 {
串lastText = navigator.Value.Substring（ 0，Math.Min（charCount，navigator.Value.Length））; 
 charCount  -  = navigator.Value.Length; 
如果（charCount< = 0）
 {
 //截断的最后文本。这里去你的搜索词的边界代码：
 navigator.SetValue（lastText）; 
断点= navigator.Clone（）; 
中断; 
} 
} 
 
 //先删除文本节点，因为微软可惜合并它们不求回报
，而（navigator.MoveToFollowing（XPathNodeType.Text））
 {
如果（navigator.ComparePosition（断点）== XmlNodeOrder.After）
 {
 navigator.DeleteSelf（）; 
} 
} 
 
 //移动到父，然后将剩下的
 navigator.MoveTo（断点）; 
，而（navigator.MoveToFollowing（XPathNodeType.Element））
 {
如果（navigator.ComparePosition（断点）== XmlNodeOrder.After）
 {
 navigator.DeleteSelf （）; 
} 
} 
 
 //移动到父
 //然后删除*所有*空节点清理（不是必要的）：
 // TODO，加空的元素，如< BR />中< IMG />为排除
 navigator.MoveToRoot（）; 
，而（navigator.MoveToFollowing（XPathNodeType.Element））
 {
，而（navigator.HasChildren&安培;！及（navigator.Value ??）.Trim（）== ）
 {
 navigator.DeleteSelf（）; 
} 
} 
 
 //移动到父
 navigator.MoveToRoot（）; 
返回navigator.InnerXml; 
} 
 
 [TestMethod的] 
公共无效TestTruncateHTMLSafeish（）
 {
 //情况下，我们只是使它开始HREF的（所以有效的空链接）
 
 //'简单'嵌套无归属标签
 Assert.AreEqual（@< H1> 1234< / H1>< b>< I> 56789< / I> 012&下; / b>中，
 TruncateHTMLSafeishChar（
 @&所述; H1> 1234所述; / H1>&所述b取代;&下; I> 56789&下; / I GT; 012345&下; / b>中， 
 12分配））; 
 
 //在中间！ 
 Assert.AreEqual（@&所述; H1> 1234所述; / H1>&下; A HREF =testurl>&所述b取代; 567&下; / B个;&下; / A>中，
 TruncateHTMLSafeishChar（
 @&所述; H1> 1234所述; / H1>&下; A HREF =testurl>&所述b取代; 5678&下; / b个;&下; / A>&下; I>&下;强>斜体部分嵌套在字符串< / STRONG>< /我>中，
 7））; 
 
 //更多
 Assert.AreEqual（@< DIV>< B>< I><强> 1< / STRONG>< / I>< / b>< / DIV>中，
 TruncateHTMLSafeishChar（
 @< DIV>< b>< I><强> 12< / STRONG>< / I>< / b>< / DIV>中，
 1））; 
 
 // BR 
 Assert.AreEqual（@< H1> 1 3 5℃/ H1>< BR /> 6，
 TruncateHTMLSafeishChar（
 @&所述; H1→1 3 5℃/ H1>&所述峰; br /> 678&所述峰; br />中，
 6分配））; 
} 
 
 [TestMethod的] 
公共无效TestTruncateHTMLSafeishWord（）
 {
 //零的情况下
 Assert.AreEqual（@.. ，
 TruncateHTMLSafeishWord（
 @，
 5分配））; 
 
 //'简单'嵌套无归属标签
 Assert.AreEqual（@< H1>一二< BR />< / H1>< B>< I>三...< / I>< / b>中，
 TruncateHTMLSafeishWord（
 @< H1>一二< BR />< / H1>< b> ;< I>三< / I>四< / b>中，
 3），我们增加了'...'来总结结束）; 
 
 //在中间！ 
 Assert.AreEqual（@< H1>一二三< / H1>< A HREF =testurl>< B类=mrclass>四... &所述; / b个;&下; / A>中，
 TruncateHTMLSafeishWord（
 @&所述; H1  - 酮二三&下; / H1>&下; A HREF =testurl>&下; b类=mrclass>四五< / b>< / A>< I><强>有的斜体嵌套在字符串< / STRONG>< /我>中，
 4））; 
 
 //启动H1 
 Assert.AreEqual的（@< H1>一二三...< / H1>中，
 TruncateHTMLSafeishWord（
 @< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强>斜体部分嵌套在字符串< / STRONG>< /我>中，
 3））; 
 
 //比文字更容易获得
 Assert.AreEqual（@< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强>有的斜体嵌套在字符串< / STRONG>< / I> ...，
 TruncateHTMLSafeishWord（
 @< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强> ;斜体部分嵌套在字符串< / STRONG>< /我>中，
 99））; 
} 
 
 [TestMethod的] 
公共无效TestTruncateHTMLSafeishWordXML（）
 {
 //零的情况下
 Assert.AreEqual（@.. ，
 TruncateHTMLSafeishWord（
 @，
 5分配））; 
 
 //'简单'嵌套无归属标签
字符串输出= TruncateHTMLSafeishCharXML（
 @<身体GT;< H1>一二< / H1>< B> ;&所述; I>三&下; / I GT;四&下; / b个;&下; /体>中，
 13分配）; LT 
 Assert.AreEqual（@&;身体GT; \r\\\
< H1>一二< / H1> \r\\\
< B> \r\\\
< I> ;三< / I> \r\\\
< / b> \r\\\
< /身体gt;中，输出，
XML版本，不......但和addeds\r \\\
 +空格为格式的文档）？; 
 
 //在中间！ 
 Assert.AreEqual（@< H1>一二三< / H1>< A HREF =testurl>< B类=mrclass>四... < / b>< / A>中，
 TruncateHTMLSafeishCharXML（
 @<身体GT;< H1>一二三< / H1>< A HREF =testurl >< b类=mrclass>四五< / b>< / A>< I><强>有的斜体嵌套在字符串< / STRONG>< / I>< /体>中，
 4分配））; 
 
 //启动H1 
 Assert.AreEqual的（@< H1>一二三...< / H1>中，
 TruncateHTMLSafeishCharXML（
 @< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强>斜体部分嵌套在字符串< / STRONG>< /我>中，
 3））; 
 
 //比文字更容易获得
 Assert.AreEqual（@< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强>有的斜体嵌套在字符串< / STRONG>< / I> ...，
 TruncateHTMLSafeishCharXML（
 @< H1>一二三< / H1>< A HREF =testurl>< b>四五< / b>< / A>< I><强> ;斜体部分嵌套在字符串< / STRONG>< /我>中，
 99））; 
} 
} 
}

解决方案

的编辑：为完整的解决方案，请参阅下文，这第一次尝试剥离HTML，二不的

让我们来总结一下你想要的：

在结果没有HTML

应该采取任何有效的数据在<身体GT;

它有一个固定的最大长度

如果您HTML是XHTML这将成为微不足道的（而且，虽然我还没有看到PHP的解决方案，我很怀疑他们使用类似的方法，但我相信这是可以理解的，相当容易）：

  XmlDocument的XML =新的XmlDocument（）; 
 
 //替换为您全面XHTML 
 xml.LoadXml的内容，以下行（@<身体GT;< P>有的< I>文字< I> /;此处< ; / p>< DIV>需要剥离< / DIV>< /身体GT;）; 
 
 //获取下<所有textnodes;身体GT; （两次//是故意的）
 XmlNodeList中的节点= xml.SelectNodes（// //身体文本（））; 
 
 //通过文本节点循环，不管你想用文字
的foreach（在节点VAR节点）
 {
的Debug.WriteLine做替换此（（（XmlCharacterData）节点）。价值）; 
}

注：空间等将被保留。这通常是一件好事。

如果你没有XHTML，您可以使用的HTML敏捷性包，该让你做大致相同普通的旧的HTML的（它在内部将其转换为一些DOM）。我还没有尝试过，但它应该运行还算顺利。

的 BIG编辑：的

实际的解决方案

在一个小评论我答应带的XHTML / XmlDocument的方法，并利用它来进行分裂你的HTML基于文本的长度，但保持的HTML代码类型安全的方法。我把下面的HTML，代码正确地打破它在中间需要，除去休息，删除空节点和自动关闭所有打开的元素。

示例HTML：

 <身体GT; 
< P>< TT>有的< U>< I>文字< I> /;此处< / U>< / TT>< / P> 
< DIV>的< B>< I>的需要和LT;跨度> STR< / SPAN> IP< / I>< / B>< S>平< / S>< / DIV> ; 
< /身体GT;

中的代码，测试，并与任何类型的输入工作（好吧，理所当然的，我只是做的部分的测试和代码可能包含bug，让你找到他们我知道！）。

  //你数据可能来自某个地方，或作为PARAMS的方法
 INT lengthAvailable = 20; 
 XmlDocument的XML =新的XmlDocument（）; 
 xml.LoadXml（@布局的HTML代码这里左出换简洁）; 
 
 //创建一个导航仪，这是我们的主要工具
 XPathNavigator的导航= xml.CreateNavigator（）; 
的XPathNavigator断点= NULL; 
 
 
串lastText =; 
 
 //找到我们所需要的文本节点：
，而（navigator.MoveToFollowing（XPathNodeType.Text））
 {
 lastText = navigator.Value.Substring（0 ，Math.Min（lengthAvailable，navigator.Value.Length））; 
 lengthAvailable  -  = navigator.Value.Length; 
 
如果（lengthAvailable< = 0）
 {
 //截断的最后文本。这里去你的搜索词的边界代码：
 navigator.SetValue（lastText）; 
断点= navigator.Clone（）; 
中断; 
} 
} 
 
 //先删除文本节点，因为微软可惜合并它们不求回报
，而（navigator.MoveToFollowing（XPathNodeType.Text））
如果（navigator.ComparePosition（断点）== XmlNodeOrder.After）
 navigator.DeleteSelf（）; //移动到父
 
 //然后将剩下的
 navigator.MoveTo（断点）; 
，而（navigator.MoveToFollowing（XPathNodeType.Element））
如果（navigator.ComparePosition（断点）== XmlNodeOrder.After）
 navigator.DeleteSelf（）; //移动到父
 
 //然后删除*所有*空节点清理（不是必要的）：
 // TODO，加空的元素，如< BR />中< ; IMG />为排除
 navigator.MoveToRoot（）; 
，而（navigator.MoveToFollowing（XPathNodeType.Element））
，而（navigator.HasChildren&安培;！及（navigator.Value ??）.Trim（）==）
 navigator.DeleteSelf（）; //移动到父
 
 navigator.MoveToRoot（）; 
的Debug.WriteLine（navigator.InnerXml）;

中的代码是如何工作的

代码做下面的事情，按照这个顺序：

它遍历所有文本节点，直到文字的大小扩大超出了允许的极限，其中情况下，它截断节点。这将自动与正确处理&放大器; GT; 等为一体的字符

然后，它缩短了破点的文字和复位它。它克隆的XPathNavigator 在这一点上，我们需要记住这个临界点。

要解决办法是MS的bug（古代的一种之一，实际上），我们必须首先删除任何剩余的文本节点，即遵循的突破点，否则我们就可能文本节点的自动合并时，他们最终为彼此的兄弟姐妹。注： DeleteSelf 是很方便，但移动导航仪位置到其父，这就是为什么我们需要检查的当前位置对在上一步中想起了突破点的位置

然后，我们做什么，我们想在第一个地方做：删除任何节点的以下的突破点

不是一个必要的步骤：清理代码，并移除任何空元素。这个动作仅仅是清理HTML和/或用于过滤特定的（DIS）允许的元素。它可以被排除在外。

返回到根，并获得内容作为字符串 InnerXml 。

这一切，很简单，虽然它可能看起来有点一见钟情望而生畏。

PS：同样是方式更容易阅读和理解是你使用XSLT，这是这种类型的就业机会的理想工具。

更新：添加扩展的代码示例，根据编辑的问题，结果
更新：添加了一些解释的

Does anyone have a c# variation of this?

This is so I can take some html and display it without breaking as a summary lead in to an article?

Truncate text containing HTML, ignoring tags

Save me from reinventing the wheel!

Edit

Sorry, new here, and your right, should have phrased the question better, heres a bit more info

I wish to take a html string and truncate it to a set number of words (or even char length) so I can then show the start of it as a summary (which then leads to the main article). I wish to preserve the html so I can show the links etc in preview.

The main issue I have to solve is the fact that we may well end up with unclosed html tags if we truncate in the middle of 1 or more tags!

The idea I have for solution is to

truncate the html to N words (words better but chars ok) first (be sure not to stop in the middle of a tag and truncate a require attribute)
work through the opened html tags in this truncated string (maybe stick them on stack as I go?)
then work through the closing tags and ensure they match the ones on stack as I pop them off?
if any open tags left on stack after this, then write them to end of truncated string and html should be good to go!!!!

Edit 12/11/2009

Here is what I have bumbled together so far as a unittest file in VS2008, this 'may' help someone in future
My hack attempts based on Jan code are at top for char version + word version (DISCLAIMER: this is dirty rough code!! on my part)
I assume working with 'well-formed' HTML in all cases (but not necessarily a full document with a root node as per XML version)
Abels XML version is at bottom, but not yet got round to fully getting tests to run on this yet (plus need to understand the code) ...
I will update when I get chance to refine
having trouble with posting code? is there no upload facility on stack?

Thanks for all comments :)

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
using System.Xml;
using System.Xml.XPath;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace PINET40TestProject
{
    [TestClass]
    public class UtilityUnitTest
    {
        public static string TruncateHTMLSafeishChar(string text, int charCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrContent = 0;

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrContent == charCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag) cntrContent++;
            }

            string substr = text.Substring(0, cntr);

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            // to be honest, this seemed like a good idea then I got lost along the way 
            // so logic is probably hanging by a thread!! 
            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishWord(string text, int wordCount)
        {
            bool inTag = false;
            int cntr = 0;
            int cntrWords = 0;
            Char lastc = ' ';

            // loop through html, counting only viewable content
            foreach (Char c in text)
            {
                if (cntrWords == wordCount) break;
                cntr++;
                if (c == '<')
                {
                    inTag = true;
                    continue;
                }

                if (c == '>')
                {
                    inTag = false;
                    continue;
                }
                if (!inTag)
                {
                    // do not count double spaces, and a space not in a tag counts as a word
                    if (c == 32 && lastc != 32)
                        cntrWords++;
                }
            }

            string substr = text.Substring(0, cntr) + " ...";

            //search for nonclosed tags        
            MatchCollection openedTags = new Regex("<[^/](.|\n)*?>").Matches(substr);
            MatchCollection closedTags = new Regex("<[/](.|\n)*?>").Matches(substr);

            // create stack          
            Stack<string> opentagsStack = new Stack<string>();
            Stack<string> closedtagsStack = new Stack<string>();

            foreach (Match tag in openedTags)
            {
                string openedtag = tag.Value.Substring(1, tag.Value.Length - 2);
                // strip any attributes, sure we can use regex for this!
                if (openedtag.IndexOf(" ") >= 0)
                {
                    openedtag = openedtag.Substring(0, openedtag.IndexOf(" "));
                }

                // ignore brs as self-closed
                if (openedtag.Trim() != "br")
                {
                    opentagsStack.Push(openedtag);
                }
            }

            foreach (Match tag in closedTags)
            {
                string closedtag = tag.Value.Substring(2, tag.Value.Length - 3);
                closedtagsStack.Push(closedtag);
            }

            if (closedtagsStack.Count < opentagsStack.Count)
            {
                while (opentagsStack.Count > 0)
                {
                    string tagstr = opentagsStack.Pop();

                    if (closedtagsStack.Count == 0 || tagstr != closedtagsStack.Peek())
                    {
                        substr += "</" + tagstr + ">";
                    }
                    else
                    {
                        closedtagsStack.Pop();
                    }
                }
            }

            return substr;
        }

        public static string TruncateHTMLSafeishCharXML(string text, int charCount)
        {
            // your data, probably comes from somewhere, or as params to a methodint 
            XmlDocument xml = new XmlDocument();
            xml.LoadXml(text);
            // create a navigator, this is our primary tool
            XPathNavigator navigator = xml.CreateNavigator();
            XPathNavigator breakPoint = null;

            // find the text node we need:
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                string lastText = navigator.Value.Substring(0, Math.Min(charCount, navigator.Value.Length));
                charCount -= navigator.Value.Length;
                if (charCount <= 0)
                {
                    // truncate the last text. Here goes your "search word boundary" code:        
                    navigator.SetValue(lastText);
                    breakPoint = navigator.Clone();
                    break;
                }
            }

            // first remove text nodes, because Microsoft unfortunately merges them without asking
            while (navigator.MoveToFollowing(XPathNodeType.Text))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent, then move the rest
            navigator.MoveTo(breakPoint);
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            // then remove *all* empty nodes to clean up (not necessary):
            // TODO, add empty elements like <br />, <img /> as exclusion
            navigator.MoveToRoot();
            while (navigator.MoveToFollowing(XPathNodeType.Element))
            {
                while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
                {
                    navigator.DeleteSelf();
                }
            }

            // moves to parent
            navigator.MoveToRoot();
            return navigator.InnerXml;
        }

        [TestMethod]
        public void TestTruncateHTMLSafeish()
        {
            // Case where we just make it to start of HREF (so effectively an empty link)

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>1234</h1><b><i>56789</i>012</b>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><b><i>56789</i>012345</b>",
                12));

            // In middle of a!
            Assert.AreEqual(@"<h1>1234</h1><a href=""testurl""><b>567</b></a>",
            TruncateHTMLSafeishChar(
                @"<h1>1234</h1><a href=""testurl""><b>5678</b></a><i><strong>some italic nested in string</strong></i>",
                7));

            // more
            Assert.AreEqual(@"<div><b><i><strong>1</strong></i></b></div>",
            TruncateHTMLSafeishChar(
                @"<div><b><i><strong>12</strong></i></b></div>",
                1));

            // br
            Assert.AreEqual(@"<h1>1 3 5</h1><br />6",
            TruncateHTMLSafeishChar(
                @"<h1>1 3 5</h1><br />678<br />",
                6));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWord()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            Assert.AreEqual(@"<h1>one two <br /></h1><b><i>three  ...</i></b>",
            TruncateHTMLSafeishWord(
                @"<h1>one two <br /></h1><b><i>three </i>four</b>",
                3), "we have added ' ...' to end of summary");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishWord(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }

        [TestMethod]
        public void TestTruncateHTMLSafeishWordXML()
        {
            // zero case
            Assert.AreEqual(@" ...",
                            TruncateHTMLSafeishWord(
                                @"",
                               5));

            // 'simple' nested none attributed tags
            string output = TruncateHTMLSafeishCharXML(
                @"<body><h1>one two </h1><b><i>three </i>four</b></body>",
                13);
            Assert.AreEqual(@"<body>\r\n  <h1>one two </h1>\r\n  <b>\r\n    <i>three</i>\r\n  </b>\r\n</body>", output,
             "XML version, no ... yet and addeds '\r\n  + spaces?' to format document");

            // In middle of a!
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four  ...</b></a>",
            TruncateHTMLSafeishCharXML(
                @"<body><h1>one two three </h1><a href=""testurl""><b class=""mrclass"">four five </b></a><i><strong>some italic nested in string</strong></i></body>",
                4));

            // start of h1
            Assert.AreEqual(@"<h1>one two three  ...</h1>",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                3));

            // more than words available
            Assert.AreEqual(@"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i> ...",
            TruncateHTMLSafeishCharXML(
                @"<h1>one two three </h1><a href=""testurl""><b>four five </b></a><i><strong>some italic nested in string</strong></i>",
                99));
        }
    }
}

解决方案

EDIT: See below for a full solution, this first attempt strips the HTML, the second does not

Let's summarize what you want:

No HTML in the result
It should take any valid data inside <body>
It has a fixed maximum length

If you HTML is XHTML this becomes trivial (and, while I haven't seen the PHP solution, I doubt very much they use a similar approach, but I believe this is understandable and rather easy):

XmlDocument xml = new XmlDocument();

// replace the following line with the content of your full XHTML
xml.LoadXml(@"<body><p>some <i>text</i>here</p><div>that needs stripping</div></body>");

// Get all textnodes under <body> (twice "//" is on purpose)
XmlNodeList nodes = xml.SelectNodes("//body//text()");

// loop through the text nodes, replace this with whatever you like to do with the text
foreach (var node in nodes)
{
    Debug.WriteLine(((XmlCharacterData)node).Value);
}

Note: spaces etc will be preserved. This is usually a good thing.

If you don't have XHTML, you can use the HTML Agility Pack, which let's you do about the same for plain old HTML (it internally converts it to some DOM). I haven't tried it, but it should run rather smooth.

BIG EDIT:

Actual solution

In a little comment I promised to take the XHTML / XmlDocument approach and use that for a typesafe method for splitting your HTML based on text length, but keeping HTML code. I took the following HTML, the code breaks it correctly in the middle of needs, removes the rest, removes empty nodes and automatically closes any open elements.

The sample HTML:

<body>
    <p><tt>some<u><i>text</i>here</u></tt></p>
    <div>that <b><i>needs <span>str</span>ip</i></b><s>ping</s></div>
</body>

The code, tested and working with any kind of input (ok, granted, I just did some tests and code may contain bugs, let me know if you find them!).

// your data, probably comes from somewhere, or as params to a method
int lengthAvailable = 20;
XmlDocument xml = new XmlDocument();
xml.LoadXml(@"place-html-code-here-left-out-for-brevity");

// create a navigator, this is our primary tool
XPathNavigator navigator = xml.CreateNavigator();
XPathNavigator breakPoint = null;


string lastText = "";

// find the text node we need:
while (navigator.MoveToFollowing(XPathNodeType.Text))
{
    lastText = navigator.Value.Substring(0, Math.Min(lengthAvailable, navigator.Value.Length));
    lengthAvailable -= navigator.Value.Length;

    if (lengthAvailable <= 0)
    {
        // truncate the last text. Here goes your "search word boundary" code:
        navigator.SetValue(lastText);
        breakPoint = navigator.Clone();
        break;
    }
}

// first remove text nodes, because Microsoft unfortunately merges them without asking
while (navigator.MoveToFollowing(XPathNodeType.Text))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then move the rest
navigator.MoveTo(breakPoint);
while (navigator.MoveToFollowing(XPathNodeType.Element))
    if (navigator.ComparePosition(breakPoint) == XmlNodeOrder.After)
        navigator.DeleteSelf();   // moves to parent

// then remove *all* empty nodes to clean up (not necessary): 
// TODO, add empty elements like <br />, <img /> as exclusion
navigator.MoveToRoot();
while (navigator.MoveToFollowing(XPathNodeType.Element))
    while (!navigator.HasChildren && (navigator.Value ?? "").Trim() == "")
        navigator.DeleteSelf();  // moves to parent

navigator.MoveToRoot();
Debug.WriteLine(navigator.InnerXml);

How the code works

The code does the following things, in that order:

It goes through all text nodes, until the text size expands beyond the allowed limit, in which case it truncates that node. This automatically deals correctly with > etc as one character.
It then shortens the text of the "breaking node" and resets it. It clones the XPathNavigator at this point as we need to remember this "breaking point".
To workaround an MS bug (an ancient one, actually), we have to remove any remaining text nodes first, that follow the breaking point, otherwise we risk auto-merging of text nodes when they end up as siblings of each other. Note: DeleteSelf is handy, but moves the navigator position to its parent, which is why we need to check the current position against the "breaking point" position remembered in the previous step.
Then we do what we wanted to do in the first place: remove any node following the breaking point.
Not a necessary step: cleaning up the code and removing any empty elements. This action is merely to clean up the HTML and/or to filter for specific (dis)allowed elements. It can be left out.
Go back to "root" and get the content as a string with InnerXml.

That's all, rather simple, though it may look a bit daunting at first sight.

PS: the same would be way easier to read and understand were you to use XSLT, which is the ideal tool for this type of jobs.

Update: added extended code sample, based on edited question
Update: added a bit of explanation

这篇关于C＃截断HTML安全，为文章概要的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C＃截断HTML安全，为文章概要 [英] c# Truncate HTML safely for article summary

问题描述

修改

编辑2009/12/11

实际的解决方案

中的代码是如何工作的

Edit

Edit 12/11/2009

Actual solution

How the code works

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

C＃截断HTML安全，为文章概要 [英] c# Truncate HTML safely for article summary

问题描述

修改

编辑2009/12/11

实际的解决方案

中的代码是如何工作的

Edit

Edit 12/11/2009

Actual solution

How the code works

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭