htmlagilitypack - 删除脚本和样式? [英] htmlagilitypack - remove script and style?

查看:143
本文介绍了htmlagilitypack - 删除脚本和样式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即时通讯使用下面的方法来提取文本格式的HTML:

 公共字符串getAllText(字符串_html)
    {
        字符串_allText =;
        尝试
        {
            HtmlAgilityPack.HtmlDocument文档=新HtmlAgilityPack.HtmlDocument();
            document.LoadHtml(_html);
            VAR根= document.DocumentNode;
            VAR SB =新的StringBuilder();
            的foreach(在root.DescendantNodesAndSelf()VAR节点)
            {
                如果(!node.HasChildNodes)
                {
                    字符串文本= node.InnerText;
                    如果(!string.IsNullOrEmpty(文本))
                        sb.AppendLine(text.Trim());
                }
            }            _allText = sb.ToString();        }
        赶上(例外)
        {
        }        _allText = System.Web.HttpUtility.HtmlDe code(_allText);        返回_allText;
    }

问题是,我还可以得到脚本和风格标签。

我怎么能拒绝呢?


解决方案

  HtmlAgilityPack.HtmlDocument DOC =新HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);doc.DocumentNode.Descendants()
                。凡(N => n.Name ==脚本|| n.Name ==风格)
                .ToList()
                .ForEach(N => n.Remove());

Im using the following method to extract text form html:

    public string getAllText(string _html)
    {
        string _allText = "";
        try
        {
            HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
            document.LoadHtml(_html);


            var root = document.DocumentNode;
            var sb = new StringBuilder();
            foreach (var node in root.DescendantNodesAndSelf())
            {
                if (!node.HasChildNodes)
                {
                    string text = node.InnerText;
                    if (!string.IsNullOrEmpty(text))
                        sb.AppendLine(text.Trim());
                }
            }

            _allText = sb.ToString();

        }
        catch (Exception)
        {
        }

        _allText = System.Web.HttpUtility.HtmlDecode(_allText);

        return _allText;
    }

Problem is that i also get script and style tags.

How could i exclude them?

解决方案

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());

这篇关于htmlagilitypack - 删除脚本和样式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆