我如何提取网页的信息？ [英] How do I extract info from a webpage?

查看：151 发布时间：2016/9/21 14:37:17 c# html regex

本文介绍了我如何提取网页的信息？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从一个网站的头版收集一些数据。我可以很容易地通过每个线运行，这只是一个特定的一个，我很感兴趣，所以我想找出正确的路线，提取的数量，在这种情况下，324我怎么能这样做呢？

 < H2>< A HREF =/ MMP /这/苏/>天气及LT; / A>< / H> <跨度类=jix_channels_count>（324）LT; / SPAN>< BR><，P类=jix_channels_desc> PROG和放大器; oslash; R，苏，SI&安培; oslash; R，测试和LT; / P> ;

解决方案

下载内容后，使用HTML解析器，如 HTML敏捷性包来确定元素属于在跨度 jix_channels_count 类。

 
 
 另一个选择是的 SgmlReader  
 
 
 您与正则表达式标记你的问题 - 我全心全意。建议你不要走这个方向
 
 
 所建议的方法（与SgmlReader）变为或多或少像这样：
  VAR URL =www.that-website.com/foo/; 
 VAR myRequest =（HttpWebRequest的）WebRequest.Create（URL）; 
 myRequest.Method =GET; 
 WebResponse的myResponse = myRequest.GetResponse（）; 
 VAR responseStream = myResponse.GetResponseStream（）; 
变种SR =新的StreamReader（responseStream，Encoding.Default）; 
变种读卡器=新SgmlReader 
 {
的DocType =HTML，
 WhitespaceHandling = WhitespaceHandling.None，
 CaseFolding = CaseFolding.ToLower，
 =的InputStream SR 
}; 
变种xmlDoc中=新的XmlDocument（）; 
 xmlDoc.Load（读卡器）; 
变种nodeReader =新的XmlNodeReader对象（xmlDoc中）; 
的XElement XML = XElement.Load（nodeReader）;

现在你可以使用LINQ到XML来（递归或其他）找到跨度元素的属性类 jix_channels_count 和读取该值元素。

 
I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify the correct line and extract the number, in this case 324. How can I do this?
<h2><a href="/mmp/it/su/">Weather</a></h2> <span class="jix_channels_count">(324)</span><br><p class="jix_channels_desc">Prog&oslash;r, su, si&oslash;r, tester</p>

 解决方案 
After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

Another option is SgmlReader.

You tagged your question with regex - I wholeheartedly advice you not taking this direction.

The suggested approach (with SgmlReader) goes more or less like so:
var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();                
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
             {
                 DocType = "HTML",
                 WhitespaceHandling = WhitespaceHandling.None,
                 CaseFolding = CaseFolding.ToLower,
                 InputStream = sr
             };
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader); 
Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

                        这篇关于我如何提取网页的信息？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

我如何提取网页的信息？ [英] How do I extract info from a webpage?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

我如何提取网页的信息？ [英] How do I extract info from a webpage?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭