网页抓取,数据挖掘,数据提取 [英] Web Scraping, data mining, data extraction

查看:145
本文介绍了网页抓取,数据挖掘,数据提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我负责创建一个网络抓取软件,我不知道哪里可以开始。任何帮助将不胜感激,即使只是告诉我这些数据是如何组织的,或者网站使用的数据布局的类型会有所帮助,因为我可以通过Google搜索该术语。



http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm = X-PML:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&节点= Buildings.Angus_addition&安培; logServerName = QUERYSERVER.UTIL2SUB&安培; logServerHandle = 327952个



http://utilsub.lbs.ubc.ca/ion/default。 ASPX?DGM = x-pml:/diagrams/ud/network.dgm& node = Buildings.AERL& unique_id = 75660a13-5145-42d5-b661-a50f328306c7& logServerName = QUERYSERVER.UTIL2SUB& logServerHandle = 327952



基本上,我需要从本网站提取谐波值。具体来说,我需要显示在第二个链接上的9个数字。这些数字不会传递给HTML,它们似乎每隔几秒就会自动更新一次。我需要能够在更新时实时提取这些值。即使这是不可能的,我仍然需要表明,做这种网络抓取是不可能的。我没有给任何后端API提供任何API,也不知道他们的网站如何接收数据。

总体而言,即使只有一些简单的搜索条件可以让我走向正确的方向,任何帮助都将值得赞赏。我目前在网络抓取/数据挖掘方面无能为力。 解决方案

网页抓取



从网站解析HTML也称为屏幕抓取。这是一个访问外部网站信息的过程(信息必须是公开的 - 公共数据)并根据需要进行处理。例如,如果我们想要从不同网站获得诺基亚Lumia 1020的平均评分,我们可以取消所有网站的评分,并计算我们代码中的平均分数。因此,我们可以说,作为一般的用户,您可以将其作为公共数据,您可以轻松地使用HTML敏捷软件包取消该功能。 HTMLAgilityPack (开源库)

使用HtmlAgilityPack刮掉HTML DOM元素(HAP)



PHP& CURL 使用PHP进行网页清理& CURL



Node.js 使用Node.js进行屏幕抓取

YQL& Ajax 使用YQL和AJAX进行屏幕抓取


I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term.

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/network.dgm&node=Buildings.AERL&unique_id=75660a13-5145-42d5-b661-a50f328306c7&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

Basically, I need to extract the "harmonic values" from this website. Specifically, I need the 9 numbers displayed on the second link. The numbers are not passed to HTML, they just seem to update automatically every few seconds. I need to able to extract these values in real time as they update. Even if that is not possible I still need to show that doing such web scraping is impossible. I am not given any API's to any of the back end, and do not know how they're site receives the data.

Overall, ANY help would be appreciated, even if its just some simple search terms to put me in the right direction. I am currently clueless in terms of web scraping/data mining/

解决方案

Web Scraping

To parse HTML from a website is otherwise called Screen Scraping. It’s a process to access external website information (the information must be public – public data) and processing it as required. For instance, if we want to get the average ratings of Nokia Lumia 1020 from different websites we can scrap the ratings from all the websites and calculate the average in our code. So we can say, as a general "User" what you can have as "Public Data", you’ll be able to scrap that using HTML Agility Pack easily.

Try These :

ASP.NET : HTMLAgilityPack (open source library)

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

PHP & CURL : WEB SCRAPING WITH PHP & CURL

Node.js : Screen Scraping with Node.js

YQL & Ajax : Screen scraping using YQL and AJAX

这篇关于网页抓取,数据挖掘,数据提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆