我怎样才能从html中提取文本 [英] How can I extract just text from the html

查看:100
本文介绍了我怎样才能从html中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要提取html <body>中存在的所有文本.样本HTML输入:-

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

输出应为:-

This is a big title. How are doing you? I am fine

我只想为此目的使用HtmlAgility.请不要使用正则表达式.

I want to use only HtmlAgility for this purpose. No regular expressions please.

我知道如何加载HtmlDocument,然后使用xquery(如"//body")来获取正文内容.但是,如何删除输出中显示的html?

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

先谢谢您了:)

推荐答案

您可以使用主体的InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

接下来,您可能要折叠空格和换行符:

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

但是请注意,在这种情况下,当它工作时,hello<br>worldhello<i>world</i>之类的标记将由InnerText转换为helloworld-删除标签.解决该问题很困难,因为显示经常由CSS决定,而不仅仅是由标记决定.

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

这篇关于我怎样才能从html中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆