用rvest，如何从submit_form（）返回的对象中提取html内容 [英] With rvest, how to extract html contents from the object returned by submit_form()

查看：422 发布时间：2018/6/21 14:24:44 html r web-scraping html-parsing rvest

本文介绍了用rvest，如何从submit_form（）返回的对象中提取html内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从pems.dot.ca.gov下载一些流量数据，按照此主题。

  rm（list = ls（））
 library（rvest）
 library（xml2）
 library（httr）
 url<  - http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id= 74250&安培; s_time_id = 1369094400&安培; s_time_id_f = 05％2F21％2F2013&安培; e_time_id = 1371772740&安培; e_time_id_f = 06％2F20％2F2013&安培; TOD =所有&安培; tod_from = 0&安培; tod_to = 0&安培; dow_5 = ON&安培; dow_6 = ON&安培; tmg_sub_id =所有&安培; q = obs_flow& gn = hour& html.x = 34& html.y = 8
 pgsession<  -  html_session（url）
 pgform< -html_form（pgsession）[[1]] 
 filled_form < -  set_values（pgform，
'username'='省略'，
'密码'='省略'）
 resp = submit_form（pgsession，filled_form）
 resp_2 = resp $ response 
 cont = resp_2 $ content

我检查了 class（）这些项目，发现resp是'session'，resp_2是'res ponse'，并且cont是'raw'。我的问题是：如何正确提取html内容，以便我可以继续使用XPath从我的页面中选择我想要的实际数据？我的直觉是，我应该解析resp_2这是一个回应，但我不能让它工作。

解决方案

这应该做到这一点：

<$ （pg，table.inlayTable）％>％
html_table（） - p $ p $ pg < - content（resp $ response） >标签头（标签[[1]]） ## X1 X2 X3 X4 ## 1数据质量数据质量 ## 2小时8车道％观察到的％估计 ## 3 05/24/2013 00:00 1,311 50 0 ## 4 05/24/2013 01:00 729 50 0 ## 5 05/24 / 2013 02:00 399 50 0 ## 6 05/24/2013 03:00 487 50 0
（您显然需要修改列名称）

I am trying to download some traffic data from pems.dot.ca.gov, following this topic.
rm(list=ls()) library(rvest) library(xml2) library(httr) url <- "http://pems.dot.ca.gov/?report_form=1&dnode=tmgs&content=tmg_volumes&tab=tmg_vol_ts&export=&tmg_station_id=74250&s_time_id=1369094400&s_time_id_f=05%2F21%2F2013&e_time_id=1371772740&e_time_id_f=06%2F20%2F2013&tod=all&tod_from=0&tod_to=0&dow_5=on&dow_6=on&tmg_sub_id=all&q=obs_flow&gn=hour&html.x=34&html.y=8" pgsession <- html_session(url) pgform <-html_form(pgsession)[[1]] filled_form <- set_values(pgform, 'username' = 'omitted', 'password' = 'omitted') resp = submit_form(pgsession, filled_form) resp_2 = resp$response cont = resp_2$content
I checked the class() of these items and found that the resp is a 'session', resp_2 is a 'response', and cont is 'raw'. My question is: how can I extract the html content correctly so that I can proceed with XPath to pick out the actual data I want from this page? My intuition is that I should parse the resp_2 which is a response, but I just can not make it work. Your help are highly appreciated!
解决方案
This should do it:
pg <- content(resp$response) html_nodes(pg, "table.inlayTable") %>% html_table() -> tab head(tab[[1]]) ## X1 X2 X3 X4 ## 1 Data Quality Data Quality ## 2 Hour 8 Lanes % Observed % Estimated ## 3 05/24/2013 00:00 1,311 50 0 ## 4 05/24/2013 01:00 729 50 0 ## 5 05/24/2013 02:00 399 50 0 ## 6 05/24/2013 03:00 487 50 0
(you'll obviously need to modify the column names)

这篇关于用rvest，如何从submit_form（）返回的对象中提取html内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用rvest，如何从submit_form（）返回的对象中提取html内容 [英] With rvest, how to extract html contents from the object returned by submit_form()

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

用rvest，如何从submit_form（）返回的对象中提取html内容 [英] With rvest, how to extract html contents from the object returned by submit_form()

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭