使用C#解析HTML以获取内容 [英] Parsing HTML to get content using C#

查看:1198
本文介绍了使用C#解析HTML以获取内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个对一组网页进行爬网的应用程序。我不希望获取页面的全部源代码,而是要获取所有内容并将其存储,并能够将页面以纯文本格式存储在数据库中。内容将在其他应用程序中使用,并且不会被用户阅读,因此不需要完全让人可读。

I am writing an application that crawls a group of my web pages. Rather than take the entire source code of the page I'd like to take all of the content and store that and be able to store the page as plain text within a database. The content will be used in other applications and not read by users so there's no need for it to be perfectly human-readable.

起初,我在考虑使用常规内容表达式,但是我无法控制网页的有效性,而且很有可能没有正则表达式会为我提供内容。

At first, I was thinking of using regular expressions, but I have no control over the validity of the web pages and there is a great chance that no regular expression would give me the content.

如果字符串中包含源代码,如何将源代码字符串转换为C#中的内容?

If I have the source code within a string, how can I turn that string of source code into just the content in C#?

推荐答案

并不是100%清楚您想要什么,但是我假设您想要文本减去标记;所以:

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
    html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
    sb.AppendLine(node.Text);
}
string final = sb.ToString();

这篇关于使用C#解析HTML以获取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆