我如何提取网页的信息? [英] How do I extract info from a webpage?

查看:151
本文介绍了我如何提取网页的信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从一个网站的头版收集一些数据。我可以很容易地通过每个线运行,这只是一个特定的一个,我很感兴趣,所以我想找出正确的路线,提取的数量,在这种情况下,324我怎么能这样做呢?



 < H2>< A HREF =/ MMP /这/苏/>天气及LT; / A>< / H> <跨度类=jix_channels_count>(324)LT; / SPAN>< BR><,P类=j​​ix_channels_desc> PROG和放大器; oslash; R,苏,SI&安培; oslash; R,测试和LT; / P> ; 


解决方案

下载内容后,使用HTML解析器,如 HTML敏捷性包来确定元素属于在跨度 jix_channels_count 类。



另一个选择是的 SgmlReader



您与正则表达式标记你的问题 - 我全心全意。建议你不要走这个方向



所建议的方法(与SgmlReader)变为或多或少像这样:

  VAR URL =www.that-website.com/foo/; 
VAR myRequest =(HttpWebRequest的)WebRequest.Create(URL);
myRequest.Method =GET;
WebResponse的myResponse = myRequest.GetResponse();
VAR responseStream = myResponse.GetResponseStream();
变种SR =新的StreamReader(responseStream,Encoding.Default);
变种读卡器=新SgmlReader
{
的DocType =HTML,
WhitespaceHandling = WhitespaceHandling.None,
CaseFolding = CaseFolding.ToLower,
=的InputStream SR
};
变种xmlDoc中=新的XmlDocument();
xmlDoc.Load(读卡器);
变种nodeReader =新的XmlNodeReader对象(xmlDoc中);
的XElement XML = XElement.Load(nodeReader);

现在你可以使用LINQ到XML来(递归或其他)找到跨度元素的属性 jix_channels_count 和读取该值元素。


I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify the correct line and extract the number, in this case 324. How can I do this?

<h2><a href="/mmp/it/su/">Weather</a></h2> <span class="jix_channels_count">(324)</span><br><p class="jix_channels_desc">Prog&oslash;r, su, si&oslash;r, tester</p>

解决方案

After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span element belonging to the jix_channels_count class.

Another option is SgmlReader.

You tagged your question with regex - I wholeheartedly advice you not taking this direction.

The suggested approach (with SgmlReader) goes more or less like so:

var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();                
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
             {
                 DocType = "HTML",
                 WhitespaceHandling = WhitespaceHandling.None,
                 CaseFolding = CaseFolding.ToLower,
                 InputStream = sr
             };
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader); 

Now you can just use LINQ to XML to (recursively or otherwise) find the span element with an attribute class whose value equals jix_channels_count and read the value of that element.

这篇关于我如何提取网页的信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆