如何从C#中的网页获取所有显示文本 [英] How to get all Display text from a webpage in C#

查看:405
本文介绍了如何从C#中的网页获取所有显示文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在使用C#开发数据抓取应用程序.

Hi I am working on data scraping application in C#.

实际上,我想获取所有的Display文本,而不是html标签.

Actually I want to get all the Display text but not the html tags.

这是我的代码

HtmlWeb web  = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
   Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str =  doc.DocumentNode.InnerText;

此内部html也返回了一些标记和脚本,但我只想获取用户可见的显示文本. 请帮我. 谢谢

This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user. Please help me. Thanks

推荐答案

[我相信这会解决您的问题] [1]

[I believe this will solve ur problem][1]

方法1 –在内存中剪切和粘贴

Method 1 – In Memory Cut and Paste

使用WebBrowser控件对象处理网页,然后从控件中复制文本…

Use WebBrowser control object to process the web page, and then copy the text from the control…

使用以下代码下载网页: 收合|复制代码

Use the following code to download the web page: Collapse | Copy Code

//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed   
wb.DocumentCompleted +=
    new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;

使用以下事件代码来处理下载的网页文本: 收合|复制代码

Use the following event code to process the downloaded web page text: Collapse | Copy Code

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand("SelectAll", false, null);
wb.Document.ExecCommand("Copy", false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}

方法2 –在内存选择对象中

Method 2 – In Memory Selection Object

这是处理下载的网页文本的第二种方法.这似乎需要更长的时间(差异很小).但是,它避免了使用剪贴板以及与此相关的限制. 收合|复制代码

This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that. Collapse | Copy Code

private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{   //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand("SelectAll", false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}

方法3 –优雅,简单,慢速的XmlDocument方法

Method 3 – The Elegant, Simple, Slower XmlDocument Approach

一个好朋友与我分享了这个例子.我是simple的忠实拥护者,这个例子赢得了简单竞赛的冠军.不幸的是,与其他两种方法相比,它非常慢.

A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.

XmlDocument对象将仅用3行简单的代码即可加载/处理HTML文件: 收合|复制代码

The XmlDocument object will load / process HTML files with only 3 simple lines of code: Collapse | Copy Code

XmlDocument document = new XmlDocument();
document.Load("www.yourwebsite.com");
string allText = document.InnerText;

那里有!三种简单的方法可以只从网页上抓取显示的文本,而不涉及外部包". 包裹

There you have it! Three simple ways to scrape only displayed text from web pages with no external "packages" involved. Packages

这篇关于如何从C#中的网页获取所有显示文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆