的HTMl敏捷包错误解析和返回的XElement [英] HTMl agility pack error parsing and returning XElement

查看:227
本文介绍了的HTMl敏捷包错误解析和返回的XElement的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可以解析文档,并生成但是输出不能被解析成,因为AP标签的的XElement的输出,在字符串中的一切是正确的分析。

I can parse the document and generate an output however the output cannot be parsed into an XElement because of a p tag, everything else within the string is parsed correctly.

我的输入:

var input = "<p> Not sure why is is null for some wierd reason!<br><br>I have implemented the auto save feature, but does it really work after 100s?<br></p> <p> <i>Autosave?? </i> </p> <p>we are talking...</p><p></p><hr><p><br class=\"GENTICS_ephemera\"></p>";

我的code:

My code:

public static XElement CleanupHtml(string input)
    {  


    HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

    htmlDoc.OptionOutputAsXml = true;
    //htmlDoc.OptionWriteEmptyNodes = true;             
    //htmlDoc.OptionAutoCloseOnEnd = true;
    htmlDoc.OptionFixNestedTags = true;

    htmlDoc.LoadHtml(input);

    // ParseErrors is an ArrayList containing any errors from the Load statement
    if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
    {

    }
    else
    {

        if (htmlDoc.DocumentNode != null)
        {
            var ndoc = new HtmlDocument(); // HTML doc instance
            HtmlNode p = ndoc.CreateElement("body");  

            p.InnerHtml = htmlDoc.DocumentNode.InnerHtml;
            var result = p.OuterHtml.Replace("<br>", "<br/>");
            result = result.Replace("<br class=\"special_class\">", "<br/>");
            result = result.Replace("<hr>", "<hr/>");
            return XElement.Parse(result, LoadOptions.PreserveWhitespace);
        }
    }
    return new XElement("body");

}

我的输出:

<body>
   <p> Not sure why is is null for some wierd reason chappy!
   <br/>
   <br/>I have implemented the auto save feature, but does it really work after 100s?
   <br/>
   </p> 
   <p> 
   <i>Autosave?? </i> 
   </p> 
   <p>we are talking...</p>
   **<p>**
   <hr/>
   <p>
   <br/>
   </p>
</body>

大胆p标签是一个没有正确输出...有没有办法解决?我做得不对的code?

The bold p tag is the one that did not output correctly... Is there a way around this? Am I doing something wrong with the code?

推荐答案

你所要做的基本上是转换的HTML输入到一个XML输出​​。

What you are trying to do is basically transform an Html input into an Xml output.

的HTML敏捷性包可以做的,当你使用 OptionOutputAsXml 选项,但在这种情况下,你不应该使用的innerHTML属性,而是让的Html敏捷包做为你的基础工作,用的HTMLDocument的保存的方法之一。

Html Agility Pack can do that when you use the OptionOutputAsXml option, but in this case, you should not use the InnerHtml property, and instead let the Html Agility Pack do the ground work for you, with one of HtmlDocument's Save methods.

下面是一个通用功能的HTML文本转换为的XElement的实例:

Here is a generic function to convert an Html text to an XElement instance:

public static XElement HtmlToXElement(string html)
{
    if (html == null)
        throw new ArgumentNullException("html");

    HtmlDocument doc = new HtmlDocument();
    doc.OptionOutputAsXml = true;
    doc.LoadHtml(html);
    using (StringWriter writer = new StringWriter())
    {
        doc.Save(writer);
        using (StringReader reader = new StringReader(writer.ToString()))
        {
            return XElement.Load(reader);
        }
    }
}

正如你看到的,你不必自己做大量的工作!请注意,由于原始的输入文本没有根元素,在HTML敏捷性包会自动添加一个封闭的 SPAN ,以确保输出是有效的XML。

As you see, you don't have to do much work by yourself! Please note that since your original input text has no root element, the Html Agility Pack will automatically add one enclosing SPAN to ensure the output is valid Xml.

在你的情况,你想additionnally过程中的一些标签,所以,这里是如何做你的为例:

In your case, you want to additionnally process some tags, so, here is how to do with your exemple:

    public static XElement CleanupHtml(string input)
    {
        if (input == null)
            throw new ArgumentNullException("input");

        HtmlDocument doc = new HtmlDocument();
        doc.OptionOutputAsXml = true;
        doc.LoadHtml(input);

        // extra processing, remove some attributes using DOM
        HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//br[@class='special_class']");
        if (coll != null)
        {
            foreach (HtmlNode node in coll)
            {
                node.Attributes.Remove("class");
            }
        }

        using (StringWriter writer = new StringWriter())
        {
            doc.Save(writer);
            using (StringReader reader = new StringReader(writer.ToString()))
            {
                return XElement.Load(reader);
            }
        }
    }

正如你看到的,你不应该使用原始字符串函数,而是使用HTML敏捷性包DOM函数(的SelectNodes,添加,删除等)。

As you see, you should not use raw string function, but instead use the Html Agility Pack DOM functions (SelectNodes, Add, Remove, etc...).

这篇关于的HTMl敏捷包错误解析和返回的XElement的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆