我怎样才能从html中提取文本 [英] How can I extract just text from the html

查看：100 发布时间：2020/11/24 19:25:00 c# html-agility-pack

本文介绍了我怎样才能从html中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要提取html <body>中存在的所有文本.样本HTML输入:-

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

输出应为:-

This is a big title. How are doing you? I am fine

我只想为此目的使用HtmlAgility.请不要使用正则表达式.

I want to use only HtmlAgility for this purpose. No regular expressions please.

我知道如何加载HtmlDocument，然后使用xquery(如"//body")来获取正文内容.但是，如何删除输出中显示的html?

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

先谢谢您了:)

推荐答案

您可以使用主体的InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

接下来，您可能要折叠空格和换行符:

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

但是请注意，在这种情况下，当它工作时，hello world或helloworld之类的标记将由InnerText转换为helloworld-删除标签.解决该问题很困难，因为显示经常由CSS决定，而不仅仅是由标记决定.

Note, however, that while it is working in this case, markup such as hello world or helloworld will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

这篇关于我怎样才能从html中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我怎样才能从html中提取文本 [英] How can I extract just text from the html

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

我怎样才能从html中提取文本 [英] How can I extract just text from the html

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭