如何使用基于C#.net的dektop应用程序从网页获取所有html,css内容? [英] How can I get all the html,css content from of an webpage using an C#.net based dektop application?

查看:105
本文介绍了如何使用基于C#.net的dektop应用程序从网页获取所有html,css内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取网页的所有内容。我的目标是使用这些内容生成报告。



假设我在网页上有一个表格。现在我想得到所有的html& css内容然后我想把它放在excel中。我已经通过webbrowser创建了它,当表的数据存在时我正在使用C#(。NET)。但问题是webbrowser不支持所有的css和jquery函数,而且我的表数据不是常量。



还有其他机制吗?通过哪种方式,我可以获取网页的所有内容,或者我可以保存所有内容,然后我想获得所有内容。然后我想创建一个excel,我想把所有这些数据与我从网页上获得的css。



这是我的示例代码。但是请给我一些其他方式。



I want to get all the content of an webpage. My target is to generate a report using those content.

Suppose I have a table in webpage. Now I want to get all the html & css content then I want to put it in excel. I have already made it through webbrowser and I am using C#(.NET) when the data of table are contant.But the problem is that webbrowser doesn't support all the css and jquery function and my the data of table is not constant.

Is there any other machanism ? In which way I can get all the content of an webpage or I can save all content and then I want to get all that content. Then I want create an excel and I want to put all those data with its css which I was getting from webpage.

This is my sample code. But please give me some other way .

string input1 = "http://localhost:62343/login.html";
 webBrowser1.Navigate(input1);







样本表:




Sample Table:

 <div style="removed: absolute; removed 0px;" class="tab-1 st_view st_view_first st_view_active">
                   <div style="removed: relative; height:400px;" class="tabcontent"> 
 <table report_type="Horizontal Table" auto-aggregation="off" id="HorTable0" sort="asc" relativeid="HorTable1" top="50" bottom="" left="50" right="">

<thead>
<tr>
<td>Node Name</td><td>Hit</td><td>Duration</td>
</tr>
</thead>

<tbody>
<tr><td att="@VAL([NodeLogQuery1].[NodeName])" data_type="undefined"></td><td att="@VAL([NodeLogQuery1].[Hit])" data_type="undefined"></td><td att="@VAL([NodeLogQuery1].[Duration])" data_type="undefined"></td></tr></tbody></table><table report_type="Horizontal Table" auto-aggregation="off" id="HorTable1" sort="asc" relativeid="HorTable0" top="" bottom="10" left="" right="50"><thead><tr><td>Node Name</td><td>Hit</td><td>Duration</td></tr></thead><tbody><tr><td att="@VAL([NodeLogQuery1].[NodeName])" data_type="undefined"></td><td att="@VAL([NodeLogQuery1].[Hit])" data_type="undefined"></td><td att="@VAL([NodeLogQuery1].[Duration])" data_type="undefined"></td></tr></tbody>

</table></div>

                </div>

推荐答案

事实上,正如Afzaal在这里向您展示的那样,您可以使用WebClient获取浏览器中当前呈现页面的原始文本。



但是原始文本可能通过指向外部文件的链接来合并CSS,你必须解析原始文本以发现这些文件,提取文件路径,然后使用WebCli如果内容未被压缩,则获取其内容 。我说可能因为现在很少见到基页中的内联样式定义。



即使你保存了(假设你的浏览器支持它)一个网页作为MHTML html存档,你仍然需要找到一种方法来获取CSS的链接文件以及你感兴趣的其他任何文件类型。



幸运的是,有一篇2005 CodeProject文章[ ^ ]将获取CSS和HTML并将其包装在MHTML存档中,同时解决某些安全问题你可能有。它不使用WebClient来获取链接文件,因此它可以处理加密的链接文件。



我尝试使用代码已经有几年了在这个artilce中,浏览器肯定会发生变化,因此,与任何CP文章一样,我建议您阅读有关使用代码的最新成功/失败/问题报告的用户评论,测试它是否符合您当前的需求。
You can, indeed, as Afzaal shows you here, use WebClient to get the "raw text" of the currently rendered page in the Browser.

However that "raw text" will probably incorporate CSS by links to external files, and you will have to parse the "raw text" to discover those files, extract the file-paths, and then use the WebClient to get their contents if the contents are not compressed. I say probably because it's rare these days to see in-line style definitions in a base-page.

Even if you saved (assuming your browser supports it) a web-page as an MHTML html archive, you'd still have to find a way to get linked-to files for CSS and whatever other file-types you are interested in.

Fortunately for you, there's a 2005 CodeProject article [^] that will grab CSS as well as HTML and wrap it for you in an MHTML archive, also solving a certain security problem you may have. It does not use WebClient to get linked-to files, so it can handle encrypted linked-to files.

It has been a few years since I tried using the code in this artilce, and browsers certainly change, so, as with any CP article, I suggest you read the user comments for any recent success/failure/problem reports about using the code, the test it to see if it meets your current needs.


您需要WebClient而不是WebBrowser才能从文档中下载HTML和所有其他内容。 WebBrowser只会查看应用程序中的网页,不允许您将其数据用作字符串(或其他数据类型)。 WebClient可用于下载资源。在这种情况下,字符串将包含HTML(XML)标记,您可以使用任何XmlReader进行操作,查看 .NET内置的一个 [ ^ ]。



例如这里的代码,



You would require a WebClient and not a WebBrowser to download the HTML and all other contents from the documents. WebBrowser would just view the web pages in your application not let you use their data as a string (or other data type). WebClient can be used to download the resources. In this case the string would contain HTML (XML) markup, that you can manipulate using any XmlReader, have a look at the .NET's built-in one[^].

For example this code here,

// required
using System.Net;

// create an instance
WebClient webClient = new WebClient();

// call the HTML page you want to download, and get it as a string
string htmlCode = webClient.DownloadString("{web page (or resource) you want to download}");





然后你应该删除所有使用的资源, webClient.Dispose(); MSDN文档 [ ^ ]表示此方法将请求资源下载为一个字符串。



You should then remove all the resources used, webClient.Dispose();. The MSDN documentation[^] says that this method downloads the requests resource as a string.


这篇关于如何使用基于C#.net的dektop应用程序从网页获取所有html,css内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆