将网页抓取的响应保存为 csv 文件 [英] Save response from web-scraping as csv file

查看:56
本文介绍了将网页抓取的响应保存为 csv 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 rvest 从网站下载了一个文件.如何将响应保存为 csv 文件?

I downloaded a file from a website with rvest. How can I save the response as a csv file?

第 1 步:Monkey 补丁 rvest 包,如此线程:如何在没有按钮参数的 Rvest 包中提交登录表单

Step 1: Monkey patch rvest package like in this thread: How to submit login form in Rvest package w/o button argument

library(tidyverse)
library(rvest)
library(R.utils)

# monkey path submit_form
custom.submit_request <- function (form, submit = NULL) 
{
  is_submit <- function(x) {
    if (!exists("type", x) | is.null(x$type)){
      return(F);
    }
    tolower(x$type) %in% c("submit", "image", "button")
  } 
  submits <- Filter(is_submit, form$fields)
  if (length(submits) == 0) {
    stop("Could not find possible submission target.", call. = FALSE)
  }
  if (is.null(submit)) {
    submit <- names(submits)[[1]]
    message("Submitting with '", submit, "'")
  }
  if (!(submit %in% names(submits))) {
    stop("Unknown submission name '", submit, "'.\n", "Possible values: ", 
         paste0(names(submits), collapse = ", "), call. = FALSE)
  }
  other_submits <- setdiff(names(submits), submit)
  method <- form$method
  if (!(method %in% c("POST", "GET"))) {
    warning("Invalid method (", method, "), defaulting to GET", 
            call. = FALSE)
    method <- "GET"
  }
  url <- form$url
  fields <- form$fields
  fields <- Filter(function(x) length(x$value) > 0, fields)
  fields <- fields[setdiff(names(fields), other_submits)]
  values <- pluck(fields, "value")
  names(values) <- names(fields)
  list(method = method, encode = form$enctype, url = url, values = values)
}

reassignInPackage("submit_request", "rvest", custom.submit_request)

第 2 步:下载文件

# start scraping
url <- "https://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx"
session_1 <- html_session(url)
# there are two blue buttons:
session_1 %>%
  html_nodes(".BlueButton") %>%
  html_attr(name = "value")
#> [1] "Search" "Export"

# click export button
form <- html_form(session_1)[[1]]
session_2 <- submit_form(session = session_1, form = form, 
                         submit = "M$C$sCDTransactions$csfFilter$btnExport")

# now there are multiple buttons with hyperlinks
# get the link for the csv file
url_csv <- session_2 %>%
  html_nodes(".BlueButton") %>%
  html_attr(name = "href") %>%
  magrittr::extract2(4) %>%
  url_absolute(base = session_2$url)

# download csv file
file <- jump_to(session_2, url_csv)
file$response
#> Response [https://aws.state.ak.us/ApocReports/CampaignDisclosure/CDExpenditures.aspx?exportAll=False&exportFormat=CSV&isExport=True]
#>   Date: 2018-09-22 17:49
#>   Status: 200
#>   Content-Type: text/comma-separated-values; charset=utf-8
#>   Size: 6.34 kB
#> "Result","Date","Transaction Type","Payment Type","Payment Detail","Amou...
#> 1,5/8/2017,Expenditure,Future Campaign Account,,$200.00,US Postal Servic...
#> 2,11/29/2017,Expenditure,Bank Fee,,$12.00,Denali FCU,,440 E 36th Ave,Anc...
#> 3,1/1/2018,Expenditure,Electronic Funds Transfer,,$3.54,Google,,1600 Amp...
#> 4,12/31/2017,Expenditure,Electronic Funds Transfer,,$107.89,PayPal,,1840...
#> 5,1/31/2018,Expenditure,Electronic Funds Transfer,,$16.42,Paypal,,1840 E...
#> 6,2/1/2018,Expenditure,Check,197,$300.00,Corbett,Joshua,2448 Sprucewood ...
#> 7,2/1/2018,Expenditure,Electronic Funds Transfer,,$5.00,Google,,1600 Amp...
#> 8,2/28/2017,Expenditure,Bank Fee,,$4.10,First National Bank Alaska,,646 ...
#> 9,3/31/2017,Expenditure,Bank Fee,,$4.10,First National Bank Alaska,,646 ...
#> ...

reprex 包 (v0.2.1) 于 2018 年 9 月 22 日创建

Created on 2018-09-22 by the reprex package (v0.2.1)

响应看起来很有希望.如何将该响应直接保存为 csv-file?

The response looks promising. How can I save that response directly as csv-file?

推荐答案

httr::content(file$response, as="text") %>% write_lines("file.csv")

我正在回答这个问题,因此可以将问题标记为已解决.所有功劳归于@hrbrmstr.

I'm answering this so the question can be marked as solved. All credit goes to @hrbrmstr.

这篇关于将网页抓取的响应保存为 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆