使用 rvest 抓取 HTML data.table [英] scraping HTML data.table using rvest

查看:31
本文介绍了使用 rvest 抓取 HTML data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从

那是你的目标.使用以下内容,您可以传入 MN DNR URL 或仅传入 URL 末尾的 id 并取回数据.

库(httr)图书馆(jsonlite)read_lake_survey <- 函数(orig_url_or_id){orig_url_or_id <- orig_url_or_id[1]如果(grepl(^ htt",orig_url_or_id)){tmp <- httr::parse_url(orig_url_or_id)如果(!is.null(tmp$query$downum)){orig_url_or_id <- tmp$query$downum} 别的 {stop("指定的 URL 无效", call.=FALSE)}}httr::GET(url = "http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",查询 = 列表(type = "lake_survey",回调 = "",id = orig_url_or_id,`_` = as.numeric(Sys.time()))) ->资源httr::stop_for_status(res)out <- httr::content(res, as="text", encoding="UTF-8")out <- jsonlite::fromJSON(out, flatten=TRUE)出去}

像这样:

orig_url <- "http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700"str(read_lake_survey(orig_url), 2)## 4 个列表## $ 时间戳:int 1506900750## $ 状态:chr成功"## $ 结果:13 个列表## ..$ averageWaterClarity: chr "7.0"## ..$ sampledPlants : list()## ..$ officeCode : chr "F314"## ..$ 滨海英亩:int 76## ..$ shoreLengthMiles : num 2.45## ..$ areaAcres : num 152## ..$ 调查:'data.frame':6 个观察.共 52 个变量:## ..$ 访问 :'data.frame': 1 obs.共 5 个变量:## ..$ LakeName : chr "Weaver"## ..$ DOWNNumber : chr "27011700"## ..$ waterClarity : chr [1, 1:2] "07/14/2008" "7"## ..$ meanDepthFeet : num 20.7## ..$ maxDepthFeet : int 57## $ message : chr "正常执行."str(read_lake_survey("27011700"), 2)## 4 个列表## $ 时间戳:int 1506900750## $ 状态:chr成功"## $ 结果:13 个列表## ..$ averageWaterClarity: chr "7.0"## ..$ sampledPlants : list()## ..$ officeCode : chr "F314"## ..$ 滨海英亩:int 76## ..$ shoreLengthMiles : num 2.45## ..$ areaAcres : num 152## ..$ 调查:'data.frame':6 个观察.共 52 个变量:## ..$ 访问 :'data.frame': 1 obs.共 5 个变量:## ..$ LakeName : chr "Weaver"## ..$ DOWNNumber : chr "27011700"## ..$ waterClarity : chr [1, 1:2] "07/14/2008" "7"## ..$ meanDepthFeet : num 20.7## ..$ maxDepthFeet : int 57## $ message : chr "正常执行."str(read_lake_survey("http://example.com"))## 错误:指定的 URL 无效## 3. stop("指定的 URL 无效", call. = FALSE)## 2. read_lake_survey("http://example.com")## 1. str(read_lake_survey("http://example.com"))

你可以戳它来证明它就在那里.

图书馆(tidyverse)# 获取数据到变量中数据 <- read_lake_survey(orig_url)# 专注于调查调查 <- dat$result$surveys

与页面上的弹出窗口匹配的调查有n"个数据框.

在同一个弹出窗口中,还有许多其他带有n"个条目的列表元素与调查相关联.我不做这种类型的分析,所以我不知道将数据框放入或不放入有什么意义.

这可能足以让您走得更远.它只是在调查中添加其他元素.

map2(surveys$fishCatchSummaries,surveys$surveyDate, ~{ .x$survey_date <- .y ; .x }) %>%map2(surveys$surveyType, ~{ .x$survey_type <- .y; .x }) %>%map2(surveys$surveySubType, ~{ .x$survey_subtype <- .y; .x }) %>%map2_df(surveys$surveyID, ~{ .$survey_id <- .y; .x }) %>%as_tibble() %>%type_convert() %>%一瞥()## 观察:120## 变量:12## $ quartileCount 0.5-7.5"、0.7-4.2"、不适用"、0.4-2.2"、0.9-5.7"、1.5-7.3"...## $ CPUE <dbl>25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...## $ totalCatch <int>50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...## $ 物种<chr>YEB"、PMK"、HSF"、WTS"、YEB"、NOP"、BLG"、BLC"、BLC...## $ totalWeight <dbl>41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....## $ quartileWeight <chr>0.5-0.8"、0.1-0.2"、不适用"、1.5-2.4"、0.5-0.8"、2.0-3.5"...## $ averageWeight <dbl>0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...## $ gearCount <int>2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...## $ 齿轮<chr>标准刺网"、标准诱捕网"、标准诱捕网"## $survey_date <date>1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...## $survey_type <chr>《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查...## $survey_subtype <chr>《人口评估》、《人口评估》、《人口...

如果您不熟悉管道,这只是一种避免临时变量的方法.

tmp <- map2(surveys$fishCatchSummaries,surveys$surveyDate, ~{ .x$survey_date <- .y ; .x })tmp <- map2(tmp,surveys$surveyType, ~{.x$survey_type <-.y;.x})tmp <- map2(tmp,surveys$surveySubType, ~{.x$survey_subtype <-.y;.x})tmp <- map2_df(tmp,surveys$surveyID, ~{.$survey_id <-.y;.x})tmp <- as_tibble(tmp)final_data <- type_convert(tmp)一瞥(final_data)## 观察:120## 变量:12## $ quartileCount 0.5-7.5"、0.7-4.2"、不适用"、0.4-2.2"、0.9-5.7"、1.5-7.3"...## $ CPUE <dbl>25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...## $ totalCatch <int>50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...## $ 物种<chr>YEB"、PMK"、HSF"、WTS"、YEB"、NOP"、BLG"、BLC"、BLC...## $ totalWeight <dbl>41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....## $ quartileWeight <chr>0.5-0.8"、0.1-0.2"、不适用"、1.5-2.4"、0.5-0.8"、2.0-3.5"...## $ averageWeight <dbl>0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...## $ gearCount <int>2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...## $ 齿轮<chr>标准刺网"、标准诱捕网"、标准诱捕网"## $survey_date <date>1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...## $survey_type <chr>《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查》、《标准调查...## $survey_subtype <chr>《人口评估》、《人口评估》、《人口...最终数据## # 小标题:120 x 12## quartileCount CPUE totalCatch 种类 totalWeight quartileWeight averageWeight gearCount gearsurvey_datesurvey_typesurvey_subtype## <chr><dbl><int><chr><dbl><chr><dbl><int><chr><日期><chr><chr>## 1 0.5-7.5 25.0 50 YEB 41.75 0.5-0.8 0.83 2 标准刺网 1980-06-23 标准调查人口评估## 2 0.7-4.2 3.6 18 PMK 2.30 0.1-0.2 0.13 5 标准陷阱网 1980-06-23 标准调查人口评估## 3 N/A 4.0 20 HSF 4.50 N/A 0.23 5 标准捕集网 1980-06-23 标准调查人口评估## 4 0.4-2.2 0.5 1 WTS 3.50 1.5-2.4 3.50 2 标准刺网 1980-06-23 标准调查人口评估## 5 0.9-5.7 5.0 25 YEB 24.25 0.5-0.8 0.97 5 标准陷阱网 1980-06-23 标准调查人口评估## 6 1.5-7.3 17.5 35 NOP 146.25 2.0-3.5 4.18 2 标准刺网 1980-06-23 标准调查人口评估## 7 N/A 6.5 13 BLG 3.25 N/A 0.25 2 标准刺网 1980-06-23 标准调查人口评估## 8 2.5-16.5 1.0 2 BLC 0.60 0.1-0.3 0.30 2 标准刺网 1980-06-23 标准调查人口评估## 9 1.8-21.2 0.8 4 BLC 1.45 0.2-0.3 0.36 5 标准陷阱网 1980-06-23 标准调查人口评估## 10 N/A 0.2 1 NOP 2.50 N/A 2.50 5 标准捕集网 1980-06-23 标准调查人口评估## # ...还有 110 行

I'm trying to scrape the "Fish Sampled" table data from Minnesota DNR using R rvest package. I used the chrome extension SelectorGadget to find the xpath for the table. I'm unable to get any table data from the webpage into R. Any help is appreciated

library(rvest)

urllakes<- read_html("http://www.dnr.state.mn.us/lakefind/showreport.html?
downum=27011700")

lakesnodes <- html_nodes(urllakes,xpath = '//*[(@id = "lake-survey")]')

html_table(lakesnodes,fill=TRUE) #Error: html_name(x) == "table" is not TRUE
html_text(lakesnodes) # [1] "" but no data is returned 

解决方案

Start a new tab. Open Developer Tools. Then, go to http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700.

Go to the Network tab. Look for this:

That's your target. With the following, you can pass in a MN DNR URL or just the id at the end of the URL and get data back.

library(httr)
library(jsonlite)

read_lake_survey <- function(orig_url_or_id) {

  orig_url_or_id <- orig_url_or_id[1]

  if (grepl("^htt", orig_url_or_id)) {
    tmp <- httr::parse_url(orig_url_or_id)
    if (!is.null(tmp$query$downum)) {
      orig_url_or_id <- tmp$query$downum
    } else {
      stop("Invalid URL specified", call.=FALSE)
    }
  }

  httr::GET(
    url = "http://maps2.dnr.state.mn.us/cgi-bin/lakefinder/detail.cgi",
    query = list(
      type = "lake_survey",
      callback = "",
      id = orig_url_or_id,
      `_` = as.numeric(Sys.time())
    )
  ) -> res

  httr::stop_for_status(res)

  out <- httr::content(res, as="text", encoding="UTF-8")
  out <- jsonlite::fromJSON(out, flatten=TRUE)
  out

}

Like so:

orig_url <- "http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27011700"

str(read_lake_survey(orig_url), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("27011700"), 2)
## List of 4
##  $ timestamp: int 1506900750
##  $ status   : chr "SUCCESS"
##  $ result   :List of 13
##   ..$ averageWaterClarity: chr "7.0"
##   ..$ sampledPlants      : list()
##   ..$ officeCode         : chr "F314"
##   ..$ littoralAcres      : int 76
##   ..$ shoreLengthMiles   : num 2.45
##   ..$ areaAcres          : num 152
##   ..$ surveys            :'data.frame':  6 obs. of  52 variables:
##   ..$ accesses           :'data.frame':  1 obs. of  5 variables:
##   ..$ lakeName           : chr "Weaver"
##   ..$ DOWNumber          : chr "27011700"
##   ..$ waterClarity       : chr [1, 1:2] "07/14/2008" "7"
##   ..$ meanDepthFeet      : num 20.7
##   ..$ maxDepthFeet       : int 57
##  $ message  : chr "Normal execution."

str(read_lake_survey("http://example.com"))
##  Error: Invalid URL specified 
##    3. stop("Invalid URL specified", call. = FALSE) 
##    2. read_lake_survey("http://example.com") 
##    1. str(read_lake_survey("http://example.com")) 

You can poke at it to prove it's all there.

library(tidyverse)

# get the data into a variable
dat <- read_lake_survey(orig_url)

# focus on the surveys
surveys <- dat$result$surveys

There are "n" data frames for the surveys that match the popup on the page.

There are also many other list elements with "n" entries that are associated with the surveys in the same popup. I don't do this type of analysis so i don't know what makes sense to put with the data frames or not.

This is likely enough to get you going a bit further. It's just adding other elements to the surveys.

map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x }) %>% 
  map2(surveys$surveyType, ~{ .x$survey_type <- .y ; .x }) %>% 
  map2(surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x }) %>% 
  map2_df(surveys$surveyID, ~{ .$survey_id <- .y ; .x }) %>% 
  as_tibble() %>% 
  type_convert() %>% 
  glimpse()
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

If you're not familiar with piping, it's just a way to avoid temporary variables.

tmp <- map2(surveys$fishCatchSummaries, surveys$surveyDate, ~{ .x$survey_date <- .y ; .x })
tmp <- map2(tmp, surveys$surveyType, ~{ .x$survey_type <- .y ; .x })
tmp <- map2(tmp, surveys$surveySubType, ~{ .x$survey_subtype <- .y ; .x })
tmp <- map2_df(tmp, surveys$surveyID, ~{ .$survey_id <- .y ; .x })
tmp <- as_tibble(tmp)
final_data <- type_convert(tmp)

glimpse(final_data)
## Observations: 120
## Variables: 12
## $ quartileCount  <chr> "0.5-7.5", "0.7-4.2", "N/A", "0.4-2.2", "0.9-5.7", "1.5-7.3"...
## $ CPUE           <dbl> 25.0, 3.6, 4.0, 0.5, 5.0, 17.5, 6.5, 1.0, 0.8, 0.2, 190.0, 0...
## $ totalCatch     <int> 50, 18, 20, 1, 25, 35, 13, 2, 4, 1, 950, 1, 2, 4, 3, 13, 27,...
## $ species        <chr> "YEB", "PMK", "HSF", "WTS", "YEB", "NOP", "BLG", "BLC", "BLC...
## $ totalWeight    <dbl> 41.75, 2.30, 4.50, 3.50, 24.25, 146.25, 3.25, 0.60, 1.45, 2....
## $ quartileWeight <chr> "0.5-0.8", "0.1-0.2", "N/A", "1.5-2.4", "0.5-0.8", "2.0-3.5"...
## $ averageWeight  <dbl> 0.83, 0.13, 0.23, 3.50, 0.97, 4.18, 0.25, 0.30, 0.36, 2.50, ...
## $ gearCount      <int> 2, 5, 5, 2, 5, 2, 2, 2, 5, 5, 5, 2, 2, 2, 5, 2, 5, 5, 5, 2, ...
## $ gear           <chr> "Standard gill nets", "Standard trap nets", "Standard trap n...
## $ survey_date    <date> 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23, 1980-06-23,...
## $ survey_type    <chr> "Standard Survey", "Standard Survey", "Standard Survey", "St...
## $ survey_subtype <chr> "Population Assessment", "Population Assessment", "Populatio...

final_data
## # A tibble: 120 x 12
##    quartileCount  CPUE totalCatch species totalWeight quartileWeight averageWeight gearCount               gear survey_date     survey_type        survey_subtype
##            <chr> <dbl>      <int>   <chr>       <dbl>          <chr>         <dbl>     <int>              <chr>      <date>           <chr>                 <chr>
##  1       0.5-7.5  25.0         50     YEB       41.75        0.5-0.8          0.83         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  2       0.7-4.2   3.6         18     PMK        2.30        0.1-0.2          0.13         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  3           N/A   4.0         20     HSF        4.50            N/A          0.23         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  4       0.4-2.2   0.5          1     WTS        3.50        1.5-2.4          3.50         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  5       0.9-5.7   5.0         25     YEB       24.25        0.5-0.8          0.97         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
##  6       1.5-7.3  17.5         35     NOP      146.25        2.0-3.5          4.18         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  7           N/A   6.5         13     BLG        3.25            N/A          0.25         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  8      2.5-16.5   1.0          2     BLC        0.60        0.1-0.3          0.30         2 Standard gill nets  1980-06-23 Standard Survey Population Assessment
##  9      1.8-21.2   0.8          4     BLC        1.45        0.2-0.3          0.36         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## 10           N/A   0.2          1     NOP        2.50            N/A          2.50         5 Standard trap nets  1980-06-23 Standard Survey Population Assessment
## # ... with 110 more rows

这篇关于使用 rvest 抓取 HTML data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆