用rvest,如何从submit_form()返回的对象中提取html内容 [英] With rvest, how to extract html contents from the object returned by submit_form()

查看:422
本文介绍了用rvest,如何从submit_form()返回的对象中提取html内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从pems.dot.ca.gov下载一些流量数据,按照此主题

  rm(list = ls())
library(rvest)
library(xml2)
library(httr)
url< - http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id= 74250&安培; s_time_id = 1369094400&安培; s_time_id_f = 05%2F21%2F2013&安培; e_time_id = 1371772740&安培; e_time_id_f = 06%2F20%2F2013&安培; TOD =所有&安培; tod_from = 0&安培; tod_to = 0&安培; dow_5 = ON&安培; dow_6 = ON&安培; tmg_sub_id =所有&安培; q = obs_flow& gn = hour& html.x = 34& html.y = 8
pgsession< - html_session(url)
pgform< -html_form(pgsession)[[1]]
filled_form < - set_values(pgform,
'username'='省略',
'密码'='省略')
resp = submit_form(pgsession,filled_form)
resp_2 = resp $ response
cont = resp_2 $ content

我检查了 class()这些项目,发现resp是'session',resp_2是'res ponse',并且cont是'raw'。我的问题是:如何正确提取html内容,以便我可以继续使用XPath从我的页面中选择我想要的实际数据?我的直觉是,我应该解析resp_2这是一个回应,但我不能让它工作。

解决方案

这应该做到这一点:



<$ (pg,table.inlayTable)%>%
html_table() - p $ p $ pg < - content(resp $ response)

>标签

头(标签[[1]])
## X1 X2 X3 X4
## 1数据质量数据质量
## 2小时8车道%观察到的%估计
## 3 05/24/2013 00:00 1,311 50 0
## 4 05/24/2013 01:00 729 50 0
## 5 05/24 / 2013 02:00 399 50 0
## 6 05/24/2013 03:00 487 50 0

(您显然需要修改列名称)


I am trying to download some traffic data from pems.dot.ca.gov, following this topic.

rm(list=ls())
library(rvest)
library(xml2)
library(httr)
url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <- set_values(pgform,
                          'username' = 'omitted',
                          'password' = 'omitted')
resp = submit_form(pgsession, filled_form)
resp_2 = resp$response
cont = resp_2$content

I checked the class() of these items and found that the resp is a 'session', resp_2 is a 'response', and cont is 'raw'. My question is: how can I extract the html content correctly so that I can proceed with XPath to pick out the actual data I want from this page? My intuition is that I should parse the resp_2 which is a response, but I just can not make it work. Your help are highly appreciated!

解决方案

This should do it:

pg <- content(resp$response)

html_nodes(pg, "table.inlayTable") %>% 
  html_table() -> tab

head(tab[[1]])
##                 X1      X2           X3           X4
## 1                          Data Quality Data Quality
## 2             Hour 8 Lanes   % Observed  % Estimated
## 3 05/24/2013 00:00   1,311           50            0
## 4 05/24/2013 01:00     729           50            0
## 5 05/24/2013 02:00     399           50            0
## 6 05/24/2013 03:00     487           50            0

(you'll obviously need to modify the column names)

这篇关于用rvest,如何从submit_form()返回的对象中提取html内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆