拆分的N部分的HTML字符串 [英] Split a html string in N parts

查看:249
本文介绍了拆分的N部分的HTML字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

没有任何人有劈裂一个HTML字符串(从一个微小的MCE编辑器来),并使用C#它拆分成N个部分的一个例子吗?

Does anybody have an example of spliting a html string (coming from a tiny mce editor) and splitting it into N parts using C#?

我需要拆分串均匀而不破的话。

I need to split the string evenly without splitting words.

我想只是分裂html和使用HtmlAgilityPack尝试修复损坏的标签。虽然我不知道如何找到分割点,因为在理想情况下,应该立足于文本,而不是HTML藏汉普尔雷。

I was thinking of just splitting the html and using the HtmlAgilityPack to try and fix the broken tags. Though I'm not sure how to find the split point, as Ideally it should be based purley on the text rather than the html aswell.

任何人有关于如何任何想法对此去?

Anybody got any ideas on how to go about this?

更新

按照要求,下面是一个例子,输入和期望的输出

As requested, here is an example of input and desired output.

输入:

<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>



OUTPUT(如果分成3 COLS):

OUTPUT (When split into 3 cols):

Part1: <p><strong>Lorem ipsum dolor</strong></p>
Part2: <p><strong>sit amet, <em>consectetur</em></strong></p>
Part3: <p><strong><em>adipiscing</em></strong> elit.</p>



更新2:

我刚刚与整洁HTML戏,似乎在修复损坏的标签很好地工作,所以这可能是不错的选择,如果我能找到一种方法来定位拆分品脱?

I've just had a play with Tidy HTML and that seems to work well at fixing broken tags, so this may be good option if I can find a way to locate the split pints?

更新3

使用类似这样<方法A HREF =http://stackoverflow.com /问题/ 1613896 /截弦上,整个字,在净C> http://stackoverflow.com/questions/1613896/truncate-string-on-whole-words-in-net-c ,我现在已经成功地得到明文单词的列表,这将使了每一个部分。所以,说用整洁的HTML我对HTML一个有效的XML结构,给出的单词的这份名单中,任何人得到了什么现在是分裂的最佳方式的点子?

Using a method similar to this http://stackoverflow.com/questions/1613896/truncate-string-on-whole-words-in-net-c, I've now managed to get a list of plain text words that will make up each part. So, say using Tidy HTML I have a valid XML structure for the html, and given this list of words, anybody got an idea on what would now be the best way to split it?

更新4

任何人都可以看到使用正则表达式来找到在跟随着这样的HTML指数的一个问题:

Can anybody see an issue with using a regex to find the indices with the HTML in the followin way:

由于纯文本字符串坐阿梅德,consectetur,请用正则表达式所有空间(\s |<(| \\\
)+>? )*,在理论上发现字符串的空间和/或标记

Given the plain text string "sit amet, consectetur", replace all spaces with the regex "(\s|<(.|\n)+?>)*", in theory finding that string with any combination of spaces and/or tags

然后我可以只使用整洁的HTML来修复损坏的HTML标签?

I could then just use Tidy HTML to fix the broken html tags?

非常感谢

推荐答案

男人,这是我的一个诅咒!我显然不能从问题走开无需花费先进的和包含的不合理的时间在其金额。

A Proposed Solution

Man, this is a curse of mine! I apparently cannot walk away from a problem without spending up-to-and-including an unreasonable amount of time on it.

我想过这个。我想过HTML整洁,也许它会工作,但我遇到了麻烦环绕我周围的头。

I thought about this. I thought about HTML Tidy, and maybe it would work, but I had trouble wrapping my head around it.

所以,我写我自己的解决方案。

So, I wrote my own solution.

我测试上的的输入和我一起扔自己其他一些输入。这似乎工作得很好。 。当然,也有像蜂窝状的,但它可能为你提供一个起点

I tested this on your input and on some other input that I threw together myself. It seems to work pretty well. Surely there are holes in it, but it might provide you with a starting point.

不管怎样,我的做法是这样的:

Anyway, my approach was this:


  1. 使用类,其中包括有关该单词的HTML文档层次中的位置,到给定的顶信息封装单个词的概念HTML文档英寸这是我在下面的 HtmlWord 类已经实现了。

  2. 创建一个类,能写的这些HTML单词组成一个单一的线以上,这样启动的元素和结束元素标记在适当的地方增加。这是我在下面的 HtmlLine 类已经实现了。

  3. 从一个立刻可以直观地访问写的几个扩展方法,使这些类直 HtmlAgilityPack.HtmlNode 对象。这些我在下面的的HtmlHelper 类实现。

  1. Encapsulate the notion of a single word in an HTML document using a class that includes information about that word's position in the HTML document hierarchy, up to a given "top". This I have implemented in the HtmlWord class below.
  2. Create a class that is capable of writing a single line composed of these HTML words above, such that start-element and end-element tags are added in the appropriate places. This I have implemented in the HtmlLine class below.
  3. Write a few extension methods to make these classes immediately and intuitively accessible straight from an HtmlAgilityPack.HtmlNode object. These I have implemented in the HtmlHelper class below.

我疯了做这一切?可能是。但是,你知道,如果你不能找出任何的其他的方式,你可以试试这个。

Am I crazy for doing all this? Probably, yes. But, you know, if you can't figure out any other way, you can give this a try.

下面是它如何与工作原理您的样本输入:

Here's how it works with your sample input:

var document = new HtmlDocument();
document.LoadHtml("<p><strong>Lorem ipsum dolor sit amet, <em>consectetur adipiscing</em></strong> elit.</p>");

var nodeToSplit = document.DocumentNode.SelectSingleNode("p");
var lines = nodeToSplit.SplitIntoLines(3);

foreach (var line in lines)
    Console.WriteLine(line.ToString());



输出:

Output:

<p><strong>Lorem ipsum dolor </strong></p>
<p><strong>sit amet, <em>consectetur </em></strong></p>
<p><strong><em>adipiscing </em></strong>elit. </p>

和现在的代码:

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public class HtmlWord {
    public string Text { get; private set; }
    public HtmlNode[] NodeStack { get; private set; }

    // convenience property to display list of ancestors cleanly
    // (for ease of debugging)
    public string NodeList {
        get { return string.Join(", ", NodeStack.Select(n => n.Name).ToArray()); }
    }

    internal HtmlWord(string text, HtmlNode node, HtmlNode top) {
        Text = text;
        NodeStack = GetNodeStack(node, top);
    }

    private static HtmlNode[] GetNodeStack(HtmlNode node, HtmlNode top) {
        var nodes = new Stack<HtmlNode>();

        while (node != null && !node.Equals(top)) {
            nodes.Push(node);
            node = node.ParentNode;
        };

        return nodes.ToArray();
    }
}



HtmlLine类



HtmlLine class

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Xml;

using HtmlAgilityPack;

[Flags()]
public enum NodeChange {
    None = 0,
    Dropped = 1,
    Added = 2
}

public class HtmlLine {
    private List<HtmlWord> _words;
    public IList<HtmlWord> Words {
        get { return _words.AsReadOnly(); }
    }

    public int WordCount {
        get { return _words.Count; }
    }

    public HtmlLine(IEnumerable<HtmlWord> words) {
        _words = new List<HtmlWord>(words);
    }

    private static NodeChange CompareNodeStacks(HtmlWord x, HtmlWord y, out HtmlNode[] droppedNodes, out HtmlNode[] addedNodes) {
        var droppedList = new List<HtmlNode>();
        var addedList = new List<HtmlNode>();

        // traverse x's NodeStack backwards to see which nodes
        // do not include y (and are therefore "finished")
        foreach (var node in x.NodeStack.Reverse()) {
            if (!Array.Exists(y.NodeStack, n => n.Equals(node)))
                droppedList.Add(node);
        }

        // traverse y's NodeStack forwards to see which nodes
        // do not include x (and are therefore "new")
        foreach (var node in y.NodeStack) {
            if (!Array.Exists(x.NodeStack, n => n.Equals(node)))
                addedList.Add(node);
        }

        droppedNodes = droppedList.ToArray();
        addedNodes = addedList.ToArray();

        NodeChange change = NodeChange.None;
        if (droppedNodes.Length > 0)
            change &= NodeChange.Dropped;
        if (addedNodes.Length > 0)
            change &= NodeChange.Added;

        // could maybe use this in some later revision?
        // not worth the effort right now...
        return change;
    }

    public override string ToString() {
        if (WordCount < 1)
            return string.Empty;

        var lineBuilder = new StringBuilder();

        using (var lineWriter = new StringWriter(lineBuilder))
        using (var xmlWriter = new XmlTextWriter(lineWriter)) {
            var firstWord = _words[0];
            foreach (var node in firstWord.NodeStack) {
                xmlWriter.WriteStartElement(node.Name);
                foreach (var attr in node.Attributes)
                    xmlWriter.WriteAttributeString(attr.Name, attr.Value);
            }
            xmlWriter.WriteString(firstWord.Text + " ");

            for (int i = 1; i < WordCount; ++i) {
                var previousWord = _words[i - 1];
                var word = _words[i];

                HtmlNode[] droppedNodes;
                HtmlNode[] addedNodes;

                CompareNodeStacks(
                    previousWord,
                    word,
                    out droppedNodes,
                    out addedNodes
                );

                foreach (var dropped in droppedNodes)
                    xmlWriter.WriteEndElement();
                foreach (var added in addedNodes) {
                    xmlWriter.WriteStartElement(added.Name);
                    foreach (var attr in added.Attributes)
                        xmlWriter.WriteAttributeString(attr.Name, attr.Value);
                }

                xmlWriter.WriteString(word.Text + " ");

                if (i == _words.Count - 1) {
                    foreach (var node in word.NodeStack)
                        xmlWriter.WriteEndElement();
                }
            }
        }

        return lineBuilder.ToString();
    }
}



的HtmlHelper静态类



HtmlHelper static class

using System;
using System.Collections.Generic;
using System.Linq;

using HtmlAgilityPack;

public static class HtmlHelper {
    public static IList<HtmlLine> SplitIntoLines(this HtmlNode node, int wordsPerLine) {
        var lines = new List<HtmlLine>();

        var words = node.GetWords(node.ParentNode);

        for (int i = 0; i < words.Count; i += wordsPerLine) {
            lines.Add(new HtmlLine(words.Skip(i).Take(wordsPerLine)));
        }

        return lines.AsReadOnly();
    }

    public static IList<HtmlWord> GetWords(this HtmlNode node, HtmlNode top) {
        var words = new List<HtmlWord>();

        if (node.HasChildNodes) {
            foreach (var child in node.ChildNodes)
                words.AddRange(child.GetWords(top));
        } else {
            var textNode = node as HtmlTextNode;
            if (textNode != null && !string.IsNullOrEmpty(textNode.Text)) {
                string[] singleWords = textNode.Text.Split(
                    new string[] {" "},
                    StringSplitOptions.RemoveEmptyEntries
                );
                words.AddRange(
                    singleWords
                        .Select(w => new HtmlWord(w, node.ParentNode, top)
                    )
                );
            }
        }

        return words.AsReadOnly();
    }
}



结论



只是重申:这是一个扔在一起的解决方案;我敢肯定,它有问题。我目前它只是作为一个起点,你要考虑 - 再次,如果你无法让你通过其他手段期望的行为。

Conclusion

Just to reiterate: this is a thrown-together solution; I'm sure it has problems. I present it only as a starting point for you to consider -- again, if you're unable to get the behavior you want through other means.

这篇关于拆分的N部分的HTML字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆