Rvest XML 网页抓取 [英] Rvest XML web scraping

查看:56
本文介绍了Rvest XML 网页抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是初学者,在抓取时遇到问题.

I'm a beginner and I have a problem with scraping.

我需要为一些客户端获取有关活动/非活动 VEIS 编号的数据.现在,我只尝试一个.在网站上,我必须:设置值并发送表单,然后浏览器重定向到下一页,在那里我可以找到一个有趣的日期.

I need to get data about the active/inactive VEIS number for a few clients. For now, I trying for only one. On the website, I have to: set values and sending the form, after that the browser redirects to the next page, where I can find an interesting date.

下面我发送了我的代码.也许有人可以提供帮助.

Below I sent my code. Maybe someone can help.

library(rvest)
library(XML)

url <- 'http://ec.europa.eu/taxation_customs/vies/vatResponse.html? 
locale=pl'
session1 <- html_session(url)
form1 <-html_form(session1)
form1

date <- set_values(form1[[1]], requesterMemberStateCode = "AT- 
Austria",requesterNumber = "4324")
date

set <- submit_form(session = session1,form = date)

推荐答案

首先你不需要XML包,rvest就足够了.

First of all you don't need the XML package, rvest is enough.

您的表单提交部分几乎正确,只是输入了错误的字段名称.

You had the form submitting part almost right, you just put in wrong field names.

library(rvest)
#> Loading required package: xml2

url <- 'http://ec.europa.eu/taxation_customs/vies/vatResponse.html?locale=pl'
session1 <- html_session(url)
form1 <-html_form(session1)
form1[[1]]
#> <form> 'vowRequest' (POST vatResponse.html)
#>   <select> 'memberStateCode' [0/29]
#>   <input text> '': --
#>   <input text> 'number': 
#>   <input text> 'traderName': 
#>   <select> 'traderCompanyType' [0/0]
#>   <input text> 'traderStreet': 
#>   <input text> 'traderPostalCode': 
#>   <input text> 'traderCity': 
#>   <select> 'requesterMemberStateCode' [0/30]
#>   <input text> '': 
#>   <input text> 'requesterNumber': 
#>   <input hidden> 'action': check
#>   <input submit> 'check': Weryfikuj

date <- set_values(form1[[1]], memberStateCode = "AT", number = "4324")

set <- submit_form(session = session1,form = date)
#> Submitting with 'NULL'

之后,提取您感兴趣的值就很容易了:

After that, extracting the values you are interested in it's easy:

set %>% 
  read_html() %>% 
  html_table(fill = TRUE) %>% 
  purrr::pluck(1) %>% 
  dplyr::slice(4:n()) %>% 
  dplyr::select(1:2)
#> # A tibble: 6 x 2
#>   X1                      X2                 
#>   <chr>                   <chr>              
#> 1 Państwo Członkowskie    AT                 
#> 2 Numer VAT               AT 4324            
#> 3 Data zapytania          2018/05/17 14:33:10
#> 4 Nazwa                   ---                
#> 5 Adres                   ---                
#> 6 Identyfikator zapytania ""

reprex 包 (v0.2.0) 于 2018 年 5 月 17 日创建.

Created on 2018-05-17 by the reprex package (v0.2.0).

这篇关于Rvest XML 网页抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆