使用循环通过网页抓取创建表格 [英] Creating a table by web-scraping using a loop
问题描述
问题是我无法弄清楚如何将刮擦结果存储到增长列表中,然后我将其制作成表格并保存为.csv文件在最后。我只能一次抓到一个县,然后重新写入。
有什么想法? (相当新的R和一般的拼抢)
i < - 1
for(i in 1:255 ){
d1< - as.character(TX_counties [i,1])$ b
$ b uri.seed< - paste(c('http:// www。 (html) - htmlTreeParse(file = uri.seed,isURL = TRUE,useInternalNodes = TRUE)
avg_taxrate< - sapply(getNodeSet(html,// div [@ class ='box'] / div / div [1] / i [1]),xmlValue)
t1 < - data.table(d1,avg_taxrate)
i < - i + 1
>
写入。 csv(t1,2015_TX_PropertyTaxes.csv)
使用 rvest
,提供一个进度条,并利用网页上的URL已经存在的事实:
library(rvest)
library(pbapply)
pg< - read_html(http://www.tax-rates.org /得克萨斯/财产税)
#得到所有的县税表链接$ b $ b ctys< - html_nodes(pg,table.propertyTaxTable> tr> td> (gsub(County,,html_text(ctys)))
$ b#匹配您的小写名字
county_name< - tolower b
$ b#spider每页并返回率%
county_rate< - pbsapply(html_attr(ctys,href),function(URL){
cty_pg< - read_html URL)
html_text(html_nodes(cty_pg,xpath =// div [@ class ='box'] / div / div [1] / i [1]))
},USE.NAMES = FALSE)
tax_table< - data.frame(county_name,county_rate,stringsAsFactors = FALSE)
tax_table
## county_name county_rate
## 1 anderson平均房屋价值的1.24%
## 2 andrews平均房屋价值的0.88%
## 3 angelina平均房屋价值的1.35%
## 4 aransas平均1.29本地价值的百分比
write.csv(tax_table,2015_TX_PropertyTaxes.csv)
注1:我限制为4,以免杀死一个提供免费数据的网站的带宽。
注2:只有254县税链接在该网站上,所以你似乎有一个额外的一个,如果你有255。
I'm attempting to webscrape tax-rates.org to get the average tax percentage for each county in Texas. I have a list of 255 counties in an csv file which I import as "TX_counties", it's a single column table. I have to create the URL for each county as a string, so I set d1 to the first cell using [i,1], then concat it into a URL string, perform the scrape, then add +1 to [i] which makes it go to the second cell for the next county name, and the process continues.
The problem is I can't figure out how to store the scrape results into a "growing list" which I then want to make into a table and save to .csv file at the end. I'm only able to scrape one county at a time and then it re-writes over itself.
Any thoughts? (fairly new to R and scraping in general)
i <- 1
for (i in 1:255) {
d1 <- as.character(TX_counties[i,1])
uri.seed <- paste(c('http://www.tax-rates.org/texas/',d1,'_county_property_tax'), collapse='')
html <- htmlTreeParse(file = uri.seed, isURL=TRUE, useInternalNodes = TRUE)
avg_taxrate <- sapply(getNodeSet(html, "//div[@class='box']/div/div[1]/i[1]"), xmlValue)
t1 <- data.table(d1,avg_taxrate)
i <- i+1
}
write.csv(t1,"2015_TX_PropertyTaxes.csv")
This uses rvest
, provides a progress bar and takes advantage of the fact that the URLs are already there for you on the page:
library(rvest)
library(pbapply)
pg <- read_html("http://www.tax-rates.org/texas/property-tax")
# get all the county tax table links
ctys <- html_nodes(pg, "table.propertyTaxTable > tr > td > a[href*='county_property']")
# match your lowercased names
county_name <- tolower(gsub(" County", "", html_text(ctys)))
# spider each page and return the rate %
county_rate <- pbsapply(html_attr(ctys, "href"), function(URL) {
cty_pg <- read_html(URL)
html_text(html_nodes(cty_pg, xpath="//div[@class='box']/div/div[1]/i[1]"))
}, USE.NAMES=FALSE)
tax_table <- data.frame(county_name, county_rate, stringsAsFactors=FALSE)
tax_table
## county_name county_rate
## 1 anderson Avg. 1.24% of home value
## 2 andrews Avg. 0.88% of home value
## 3 angelina Avg. 1.35% of home value
## 4 aransas Avg. 1.29% of home value
write.csv(tax_table, "2015_TX_PropertyTaxes.csv")
NOTE 1: I limited scraping to 4 to not kill the bandwidth of a site that offers free data.
NOTE 2: There are only 254 county tax links available on that site, so you seem to have an extra one if you have 255.
这篇关于使用循环通过网页抓取创建表格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!