我如何提取网页的信息? [英] How do I extract info from a webpage?
问题描述
我想从一个网站的头版收集一些数据。我可以很容易地通过每个线运行,这只是一个特定的一个,我很感兴趣,所以我想找出正确的路线,提取的数量,在这种情况下,324我怎么能这样做呢?
< H2>< A HREF =/ MMP /这/苏/>天气及LT; / A>< / H> <跨度类=jix_channels_count>(324)LT; / SPAN>< BR><,P类=jix_channels_desc> PROG和放大器; oslash; R,苏,SI&安培; oslash; R,测试和LT; / P> ;
下载内容后,使用HTML解析器,如 HTML敏捷性包来确定元素属于在跨度
jix_channels_count
类。
另一个选择是的 SgmlReader
您与正则表达式
标记你的问题 - 我全心全意。建议你不要走这个方向
所建议的方法(与SgmlReader)变为或多或少像这样:
VAR URL =www.that-website.com/foo/;
VAR myRequest =(HttpWebRequest的)WebRequest.Create(URL);
myRequest.Method =GET;
WebResponse的myResponse = myRequest.GetResponse();
VAR responseStream = myResponse.GetResponseStream();
变种SR =新的StreamReader(responseStream,Encoding.Default);
变种读卡器=新SgmlReader
{
的DocType =HTML,
WhitespaceHandling = WhitespaceHandling.None,
CaseFolding = CaseFolding.ToLower,
=的InputStream SR
};
变种xmlDoc中=新的XmlDocument();
xmlDoc.Load(读卡器);
变种nodeReader =新的XmlNodeReader对象(xmlDoc中);
的XElement XML = XElement.Load(nodeReader);
现在你可以使用LINQ到XML来(递归或其他)找到跨度
元素的属性类
jix_channels_count
和读取该值元素。
I want to collect some data from the front page of a website. I can easily run through each line and it is only one specific one that I am interested in. So I want to identify the correct line and extract the number, in this case 324. How can I do this?
<h2><a href="/mmp/it/su/">Weather</a></h2> <span class="jix_channels_count">(324)</span><br><p class="jix_channels_desc">Progør, su, siør, tester</p>
After downloading the contents, use an HTML Parser such as HTML Agility Pack to identify the span
element belonging to the jix_channels_count
class.
Another option is SgmlReader.
You tagged your question with regex
- I wholeheartedly advice you not taking this direction.
The suggested approach (with SgmlReader) goes more or less like so:
var url = "www.that-website.com/foo/";
var myRequest = (HttpWebRequest)WebRequest.Create(url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
var responseStream = myResponse.GetResponseStream();
var sr = new StreamReader(responseStream, Encoding.Default);
var reader = new SgmlReader
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.None,
CaseFolding = CaseFolding.ToLower,
InputStream = sr
};
var xmlDoc = new XmlDocument();
xmlDoc.Load(reader);
var nodeReader = new XmlNodeReader(xmlDoc);
XElement xml = XElement.Load(nodeReader);
Now you can just use LINQ to XML to (recursively or otherwise) find the span
element with an attribute class
whose value equals jix_channels_count
and read the value of that element.
这篇关于我如何提取网页的信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!