从网页获取特定数据 [英] get specific data from web page

查看:92
本文介绍了从网页获取特定数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,
我必须从一个网站上获取数千家公司的数据,并且必须编写一个程序以使其自动运行.我在Google上搜索了它,但无法执行.

首先,我进入包含城市名称链接的网页.当我单击城市名称时,它会向我发送一个新页面.此页面上有公司名称链接.当我单击公司名称时,它会向我发送一个新页面.然后,我必须搜索该页面并将相关信息(公司名称,电话号码,地址等)获取到excel工作表中.

Hi guys,
I have to get data of thousands of companies from a website and I have to write a program to make it automatically. I searhced it on google but couldn''t do it.

First of all, I enter the web page which has city name links. When I click on the city name, it send me a new page. There are company name links at this page. When I click on the company name, it sends me a new page. Then, I have to search that page and get the related information(company name,phone number, address etc.) into an excel worksheet. Can you help me about how I can do it, please?

推荐答案

您需要通过解析 HtmlElement [
You need to be able to find the appropriate HTML tag for the information you want by parsing the HtmlElement[^] or its collection in the Web document. Once you have found the right element then you can extract the individual fields of data, assuming they are all tagged in the way you want. For more information on these techniques try a search on "Screenscraper" in Google or the CodeProject articles.


这称为 Web Scraping http://msdn.microsoft.com/en-us/library/system.net. webrequest.aspx [ ^ ](此处为HttpWebRequest示例),
http://msdn.microsoft.com/en-us/library/system.net. httpwebrequest.aspx [ ^ ].

使用HttpWebRequest您将获得一个文档.如果这是HTML文档,则需要对其进行解析.如果这是一个格式良好的XML文档,那就很好了,那么您可以使用.NET XML解析器之一对其进行解析.不幸的是,并非所有的网页都像这样,因此您可能需要HTML解析器,该解析器不需要格式良好的XML合规性.试试这个: http://www.majestic12.co.uk/projects/html_parser.php [ ^ ].

—SA
This is called Web Scraping, http://en.wikipedia.org/wiki/Web_scraping[^].

You need to use the class System.Net.HttpWebRequest, but to create one, your will need a variable of compile-time class System.Net.WebRequest, see:
http://msdn.microsoft.com/en-us/library/system.net.webrequest.aspx[^] (HttpWebRequest sample here),
http://msdn.microsoft.com/en-us/library/system.net.httpwebrequest.aspx[^].

Using HttpWebRequest you will obtain a document. If this is a HTML document, you will need to parse it. It would be good if this is a document well-formed as XML, then you could parse it using one of .NET XML parsers. Not all Web pages are like that, unfortunately, so you may need HTML parser which does not require well-formed XML compliance. Try this one: http://www.majestic12.co.uk/projects/html_parser.php[^].

—SA


这篇关于从网页获取特定数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆