我怎样才能从html中提取文本 [英] How can I extract just text from the html
问题描述
我需要提取html <body>
中存在的所有文本.样本HTML输入:-
I have a requirement to extract all the text that is present in the <body>
of the html. Sample Html input :-
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src="abc.jpg"/>
</body>
</html>
输出应为:-
This is a big title. How are doing you? I am fine
我只想为此目的使用HtmlAgility.请不要使用正则表达式.
I want to use only HtmlAgility for this purpose. No regular expressions please.
我知道如何加载HtmlDocument,然后使用xquery(如"//body")来获取正文内容.但是,如何删除输出中显示的html?
I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?
先谢谢您了:)
推荐答案
您可以使用主体的InnerText
:
string html = @"
<html>
<title>title</title>
<body>
<h1> This is a big title.</h1>
How are doing you?
<h3> I am fine </h3>
<img src=""abc.jpg""/>
</body>
</html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;
接下来,您可能要折叠空格和换行符:
Next, you may want to collapse spaces and new lines:
text = Regex.Replace(text, @"\s+", " ").Trim();
但是请注意,在这种情况下,当它工作时,hello<br>world
或hello<i>world</i>
之类的标记将由InnerText
转换为helloworld
-删除标签.解决该问题很困难,因为显示经常由CSS决定,而不仅仅是由标记决定.
Note, however, that while it is working in this case, markup such as hello<br>world
or hello<i>world</i>
will be converted by InnerText
to helloworld
- removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.
这篇关于我怎样才能从html中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!