使用循环通过网页抓取创建表格 [英] Creating a table by web-scraping using a loop

查看：149 发布时间：2018/1/28 13:31:41 r for-loop web-scraping rvest

本文介绍了使用循环通过网页抓取创建表格的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在试图通过网站tax-rates.org来得到德克萨斯州每个县的平均税率。我有一个csv文件中的255县的列表，我导入为TX_counties，它是一个单列表。我必须创建每个县作为一个字符串的URL，所以我设置d1到第一个单元格[i，1]，然后连接到一个URL字符串，执行刮，然后添加+1 [i]这使得它将转到下一个县名的第二个单元格，并且过程继续。

问题是我无法弄清楚如何将刮擦结果存储到增长列表中，然后我将其制作成表格并保存为.csv文件在最后。我只能一次抓到一个县，然后重新写入。

有什么想法？（相当新的R和一般的拼抢）

  i < -  1 
 for（i in 1：255 ）{
 
 d1<  -  as.character（TX_counties [i，1]）$ b 
 $ b uri.seed<  -  paste（c（'http：// www。 （html） -  htmlTreeParse（file = uri.seed，isURL = TRUE，useInternalNodes = TRUE） 
 
 avg_taxrate<  -  sapply（getNodeSet（html，// div [@ class ='box'] / div / div [1] / i [1]），xmlValue）
 
 t1 < -  data.table（d1，avg_taxrate）
 
i < -  i + 1 
 
> 
 
写入。 csv（t1，2015_TX_PropertyTaxes.csv）

解决方案

使用 rvest ，提供一个进度条，并利用网页上的URL已经存在的事实：

  library（rvest）
 library（pbapply）
 
 pg<  -  read_html（http://www.tax-rates.org /得克萨斯/财产税）
 
＃得到所有的县税表链接$ b $ b ctys<  -  html_nodes（pg，table.propertyTaxTable> tr> td> （gsub（County，，html_text（ctys）））
 
 $ b＃匹配您的小写名字
 county_name<  -  tolower b 
 $ b＃spider每页并返回率％
 county_rate<  -  pbsapply（html_attr（ctys，href），function（URL）{
 cty_pg<  -  read_html URL）
 html_text（html_nodes（cty_pg，xpath =// div [@ class ='box'] / div / div [1] / i [1]））
}，USE.NAMES = FALSE）
 
 tax_table<  -  data.frame（county_name，county_rate，stringsAsFactors = FALSE）
 
 tax_table 
 ## county_name county_rate 
 ## 1 anderson平均房屋价值的1.24％
 ## 2 andrews平均房屋价值的0.88％
 ## 3 angelina平均房屋价值的1.35％
 ## 4 aransas平均1.29本地价值的百分比
 
 write.csv（tax_table，2015_TX_PropertyTaxes.csv）

注1：我限制为4，以免杀死一个提供免费数据的网站的带宽。

注2：只有254县税链接在该网站上，所以你似乎有一个额外的一个，如果你有255。

I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.

The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.

Any thoughts? (fairly new to R and scraping in general)

i <- 1
for (i in 1:255) {

  d1 <- as.character(TX_counties[i,1])

  uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')

  html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)

  avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)

  t1 <- data.table(d1,avg_taxrate)

  i <- i+1

}

write.csv(t1,"2015_TX_PropertyTaxes.csv")

解决方案

This uses rvest, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:

library(rvest)
library(pbapply)

pg <- read_html("http://www.tax-rates.org/texas/property-tax")

# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")

# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))

# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
  cty_pg <- read_html(URL)
  html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)

tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)

tax_table
##   county_name              county_rate
## 1    anderson Avg. 1.24% of home value
## 2     andrews Avg. 0.88% of home value
## 3    angelina Avg. 1.35% of home value
## 4     aransas Avg. 1.29% of home value

write.csv(tax_table, "2015_TX_PropertyTaxes.csv")

NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.

NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.

这篇关于使用循环通过网页抓取创建表格的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用循环通过网页抓取创建表格 [英] Creating a table by web-scraping using a loop

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用循环通过网页抓取创建表格 [英] Creating a table by web-scraping using a loop

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭