使用循环通过网页抓取创建表格 [英] Creating a table by web-scraping using a loop

查看:149
本文介绍了使用循环通过网页抓取创建表格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试图通过网站tax-rates.org来得到德克萨斯州每个县的平均税率。我有一个csv文件中的255县的列表,我导入为TX_counties,它是一个单列表。我必须创建每个县作为一个字符串的URL,所以我设置d1到第一个单元格[i,1],然后连接到一个URL字符串,执行刮,然后添加+1 [i]这使得它将转到下一个县名的第二个单元格,并且过程继续。



问题是我无法弄清楚如何将刮擦结果存储到增长列表中,然后我将其制作成表格并保存为.csv文件在最后。我只能一次抓到一个县,然后重新写入。



有什么想法? (相当新的R和一般的拼抢)

  i < -  1 
for(i in 1:255 ){

d1< - as.character(TX_counties [i,1])$ ​​b
$ b uri.seed< - paste(c('http:// www。 (html) - htmlTreeParse(file = uri.seed,isURL = TRUE,useInternalNodes = TRUE)

avg_taxrate< - sapply(getNodeSet(html,// div [@ class ='box'] / div / div [1] / i [1]),xmlValue)

t1 < - data.table(d1,avg_taxrate)

i < - i + 1

>

写入。 csv(t1,2015_TX_PropertyTaxes.csv)


解决方案

使用 rvest ,提供一个进度条,并利用网页上的URL已经存在的事实:

  library(rvest)
library(pbapply)

pg< - read_html(http://www.tax-rates.org /得克萨斯/财产税)

#得到所有的县税表链接$​​ b $ b ctys< - html_nodes(pg,table.propertyTaxTable> tr> td> (gsub(County,,html_text(ctys)))

$ b#匹配您的小写名字
county_name< - tolower b
$ b#spider每页并返回率%
county_rate< - pbsapply(html_attr(ctys,href),function(URL){
cty_pg< - read_html URL)
html_text(html_nodes(cty_pg,xpath =// div [@ class ='box'] / div / div [1] / i [1]))
},USE.NAMES = FALSE)

tax_table< - data.frame(county_name,county_rate,stringsAsFactors = FALSE)

tax_table
## county_name county_rate
## 1 anderson平均房屋价值的1.24%
## 2 andrews平均房屋价值的0.88%
## 3 angelina平均房屋价值的1.35%
## 4 aransas平均1.29本地价值的百分比

write.csv(tax_table,2015_TX_PropertyTaxes.csv)



注1:我限制为4,以免杀死一个提供免费数据的网站的带宽。



注2:只有254县税链接在该网站上,所以你似乎有一个额外的一个,如果你有255。


I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.

The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.

Any thoughts? (fairly new to R and scraping in general)

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")

解决方案

This uses rvest, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.

NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.

这篇关于使用循环通过网页抓取创建表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆