如何从C#中的网页获取所有显示文本 [英] How to get all Display text from a webpage in C#
问题描述
您好,我正在使用C#开发数据抓取应用程序.
Hi I am working on data scraping application in C#.
实际上,我想获取所有的Display文本,而不是html标签.
Actually I want to get all the Display text but not the html tags.
这是我的代码
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.
Load(@"http://dawateislami.net/books/bookslibrary.do#!section:bookDetail_521.tr");
string str = doc.DocumentNode.InnerText;
此内部html也返回了一些标记和脚本,但我只想获取用户可见的显示文本. 请帮我. 谢谢
This inner html is returning some tags and scripts as well but I want to only get the Display text that's visible to user. Please help me. Thanks
推荐答案
[我相信这会解决您的问题] [1]
[I believe this will solve ur problem][1]
方法1 –在内存中剪切和粘贴
Method 1 – In Memory Cut and Paste
使用WebBrowser控件对象处理网页,然后从控件中复制文本…
Use WebBrowser control object to process the web page, and then copy the text from the control…
使用以下代码下载网页: 收合|复制代码
Use the following code to download the web page: Collapse | Copy Code
//Create the WebBrowser control
WebBrowser wb = new WebBrowser();
//Add a new event to process document when download is completed
wb.DocumentCompleted +=
new WebBrowserDocumentCompletedEventHandler(DisplayText);
//Download the webpage
wb.Url = urlPath;
使用以下事件代码来处理下载的网页文本: 收合|复制代码
Use the following event code to process the downloaded web page text: Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand("SelectAll", false, null);
wb.Document.ExecCommand("Copy", false, null);
textResultsBox.Text = CleanText(Clipboard.GetText());
}
方法2 –在内存选择对象中
Method 2 – In Memory Selection Object
这是处理下载的网页文本的第二种方法.这似乎需要更长的时间(差异很小).但是,它避免了使用剪贴板以及与此相关的限制. 收合|复制代码
This is a second method of processing the downloaded web page text. It seems to take just a bit longer (very minimal difference). However, it avoids using the clipboard and the limitations associated with that. Collapse | Copy Code
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{ //Create the WebBrowser control and IHTMLDocument2
WebBrowser wb = (WebBrowser)sender;
IHTMLDocument2 htmlDocument =
wb.Document.DomDocument as IHTMLDocument2;
//Select all the text on the page and create a selection object
wb.Document.ExecCommand("SelectAll", false, null);
IHTMLSelectionObject currentSelection = htmlDocument.selection;
//Create a text range and send the range’s text to your text box
IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange
textResultsBox.Text = range.text;
}
方法3 –优雅,简单,慢速的XmlDocument方法
Method 3 – The Elegant, Simple, Slower XmlDocument Approach
一个好朋友与我分享了这个例子.我是simple的忠实拥护者,这个例子赢得了简单竞赛的冠军.不幸的是,与其他两种方法相比,它非常慢.
A good friend shared this example with me. I am a huge fan of simple, and this example wins the simplicity contest hands down. It was unfortunately very slow compared to the other two approaches.
XmlDocument对象将仅用3行简单的代码即可加载/处理HTML文件: 收合|复制代码
The XmlDocument object will load / process HTML files with only 3 simple lines of code: Collapse | Copy Code
XmlDocument document = new XmlDocument();
document.Load("www.yourwebsite.com");
string allText = document.InnerText;
那里有!三种简单的方法可以只从网页上抓取显示的文本,而不涉及外部包". 包裹
There you have it! Three simple ways to scrape only displayed text from web pages with no external "packages" involved. Packages
这篇关于如何从C#中的网页获取所有显示文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!