用 R 抓取非 html 网站? [英] Scraping non html-websites with R?

查看:40
本文介绍了用 R 抓取非 html 网站?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从 html 网站的 html 表格中抓取数据既酷又容易.但是,如果网站不是用 html 编写的并且需要浏览器显示相关信息,例如如果是asp网站或者数据不在代码中而是通过java代码进来的?

Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g. if it's an asp website or the data is not in the code but comes in through java code?

喜欢这里:http://www.bwea.com/ukwed/construction.asp.

用VBA for excel可以写一个功能,打开和IE session调用网站,然后基本上复制粘贴网站的内容.有机会用 R 做类似的事情吗?

With VBA for excel one can write a function that opens and IE session calling the website and then basically copy and pasting the content of the website. Any chance to do something similar with R?

推荐答案

这是正常的 HTML,伴随着抓取数据后必须清理的相关正常麻烦.

This is normal HTML, with the associated normal trouble of having to clean up after scraping the data.

以下方法可以解决问题:

The following does the trick:

  • 使用 readHTMLTable 在包 XML
  • 中读取页面
  • 这是页面上的第五个表格,所以提取第五个元素
  • 取第一行并将其分配给表的名称
  • 删除第一行

代码:

x <- readHTMLTable("http://www.bwea.com/ukwed/construction.asp", 
                   as.data.frame=TRUE, stringsAsFactors=FALSE)
dat <- x[[5]]
names(dat) <- unname(unlist(dat[1, ]))

结果数据:

dat <- dat[-1, ]

'data.frame':   39 obs. of  10 variables:
 $ Date                : chr  "September 2011" "August 2011" "August 2011" "August 2011" ...
 $ Wind farm           : chr  "Baillie Wind farm - Bardnaheigh Farm" "Mains of Hatton" "Coultas Farm" "White Mill (Coldham ext)" ...
 $ Location            : chr  "Highland" "Aberdeenshire" "Nottinghamshire" "Cambridgeshire" ...
 $ Power(MW)           : chr  "2.5" "0.8" "0.33" "2" ...
 $ Turbines            : chr  "21" "3" "1" "7" ...
 $ MW Capacity         : chr  "52.5" "2.4" "0.33" "14" ...
 $ Annual homes equiv*.: chr  "29355" "1342" "185" "7828" ...
 $ Developer           : chr  "Baillie" "Eco2" "" "COOP" ...
 $ Latitude            : chr  "58 02 52N" "57 28 11N" "53 04 33N" "52 35 47N" ...
 $ Longitude           : chr  "04 07 40W" "02 30 32W" "01 18 16W" "00 07 41E" ...

这篇关于用 R 抓取非 html 网站?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆