如何使用 rvest 和 R 抓取 CGI-Bin? [英] How can I Scrape a CGI-Bin with rvest and R?

查看:55
本文介绍了如何使用 rvest 和 R 抓取 CGI-Bin?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 rvest 抓取在 cgi-bin 中弹出的网络表单的结果.但是,当我运行脚本时,结果在 200 英里内返回 0 个结果.以下是我的代码,我感谢任何反馈和帮助.主要网站是 http://www.zmax.com/ 有启动 cgi 的搜索框-bin.

I am trying to use rvest to scrape the results of a webform that pop up in a cgi-bin. However when I run the script I get back 0 results within 200 miles as the result. Below is my code I appreciate any feedback and help. The main website is http://www.zmax.com/ that has the search box that launches the cgi-bin.

library(rvest); 
library(purrr) ;
library(plyr) ;
library(dplyr) ;

x<-read_html('http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl') 

y<-x%>% html_node('table')%>% html_table(fill=true)

我也试过

y<-x%>% 
html_node('td div td, p')
%>% html_text()

我不确定返回表单上的数据时哪里出错了.

I am unsure of where I am going wrong in returning the data that is on the form.

推荐答案

奇怪的是,无论是主站点还是他们用于出口查找的提供程序都无法阻止 T&C 或 REP 进行抓取.¯\_(ツ)_/¯

Strangely enough, neither the main site nor the provider they use for outlet lookups prevents scraping by T&C or REP. ¯\_(ツ)_/¯

您应该真正熟悉浏览器开发人员工具,因为您已经能够看到主站点向查找站点发出 HTTP POST 请求与 GET请求浏览器通常会生成并且 read_html() 会生成.要获得成功的请求,您需要执行以下操作(我们会选择与您相近的邮政编码):

You should really get familiar with browser Developer Tools as you would have been able to see that the main site makes an HTTP POST request to the lookup site vs the GET request browsers normally make and that read_html() makes. Here's what you need to do to get successful requests (we'll pick a zip code near-ish you):

library(httr)
library(rvest)

POST(
  url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl", 
  body = list(zipcode = "48127"), 
  encode = "form"
) -> res

res 是一个 httr response 对象,通常就可以:

res is an httr response object and one would normally just do:

content(res, as="parsed")

为 XML/HTML 剖析准备一个已解析的对象.但是,该站点上存在奇怪的编码问题(至少对我而言),迫使我们不得不这样做:

to get a parsed object ready for XML/HTML dissection. But, there are weird encoding issues (at least for me) on that site forcing us to have to do:

content(res, as="raw") %>% read_html() -> pg

你应该 cat(as.character(pg)) 看看 HTML 有多丑.它是嵌套表,但不是很好.您在那里看到的条目都是 元素,没有

中断.谢天谢地?在每个 元素中只有单个
元素.因此,我们可以通过定位正确的 一举抓住它们:

You should cat(as.character(pg)) to see how ugly the HTML is. It's nested tables, but not in a good way. The entries you see there are all <tr> elements with no <table> breaks. Thankfully? there are only singular <td> elements in each of those <tr> elements. So, we can grab them all in one fell swoop by targeting the correct <table>:

rows <- html_nodes(pg, "table[width='300'] > tr > td")
rows
## {xml_nodeset (60)}
##  [1] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
##  [2] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">6938 NORTH TELEGRAPH ROAD</font></td>
##  [3] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
##  [4] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 792-9134</font></td>
##  [5] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=6938+NORTH+TELEGRAPH+R ...
##  [6] <td width="300" height="6"></td>
##  [7] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Advance Auto Parts</b></font></p ...
##  [8] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8120 North Telegraph Road</font></td>
##  [9] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
## [10] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 528-4920</font></td>
## [11] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8120+North+Telegraph+R ...
## [12] <td width="300" height="6"></td>
## [13] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Pep Boys</b></font></p></td>
## [14] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8955 TELEGRAPH RD</font></td>
## [15] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Redford, MI  48239</font></td>
## [16] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 532-5750</font></td>
## [17] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8955+TELEGRAPH+RD+Redf ...
## [18] <td width="300" height="6"></td>
## [19] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [20] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">27207 PLYMOUTH ROAD</font></td>
## ...

许多方法可以用来从混乱中制作数据框.一个简单的方法涉及使用商店标题具有设置背景颜色而其他名称没有的事实.这使得代码有点脆弱,但我们可以通过测试背景颜色的存在来帮助它降低脆弱性.为什么我们甚至需要这样做?好吧,我们需要标记记录的开始和结束,一个简单的方法是使用我们可以 cumsum() 一个逻辑向量,知道它 FALSE== 0. 为什么这很重要?我们可以通过这种方式创建一个隐式分组列:

There are many approaches one could take to make a data frame out of that mess. One simple one involves using the fact that the store titles have a set background color while the others do not. This makes the code a bit fragile, but we can help it be less fragile by just testing for the presence of a background color. Why do we even need to do this? Well, we need to mark start and end of records and one easy way to do this is use the fact that we can cumsum() a logical vector, knowing that it FALSE == 0. Why does that matter? We can create an implicit grouping column that way:

data_frame(
  record = !is.na(html_attr(rows, "bgcolor")),
  text = html_text(rows, trim=TRUE)
) %>% 
  mutate(record = cumsum(record)) -> xdf
#3 # A tibble: 60 x 2
#3    record                        text
#3     <int>                       <chr>
#3  1      1  "O\u0092REILLY AUTO PARTS"
#3  2      1   6938 NORTH TELEGRAPH ROAD
#3  3      1 Dearborn Heights, MI  48127
#3  4      1              (313) 792-9134
#3  5      1                0 miles away
#3  6      1                            
#3  7      2          Advance Auto Parts
#3  8      2   8120 North Telegraph Road
#3  9      2 Dearborn Heights, MI  48127
#3 10      2              (313) 528-4920
#3 # ... with 50 more rows

现在,我们需要使用 filter() 删除空行,并进行一些调整以将数据转换为合适的形式来制作数据框.这是超级脆弱的代码,因为这个特定的代码段可以处理丢失的电话号码数据,但仅此而已.如果有第二个地址行,您需要修改此方法或使用不同的方法:

Now, we need to remove the empty rows with filter() and do some munging to get the data into a decent form for making a data frame. This is super fragile code in that this particular snippet can handle missing phone number data but that's about it. If there's a second address line, you'll need to modify this approach or use a different approach:

filter(xdf, text != "") %>% 
  group_by(record) %>% 
  summarise(x = paste0(text, collapse="|")) %>% 
  separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
## # A tibble: 10 x 5
##    record                      store                  address1              city_state_zip       phone_and_or_distance
##  *  <int>                      <chr>                     <chr>                       <chr>                       <chr>
##  1      1 "O\u0092REILLY AUTO PARTS" 6938 NORTH TELEGRAPH ROAD Dearborn Heights, MI  48127 (313) 792-9134|0 miles away
##  2      2         Advance Auto Parts 8120 North Telegraph Road Dearborn Heights, MI  48127 (313) 528-4920|0 miles away
##  3      3                   Pep Boys         8955 TELEGRAPH RD          Redford, MI  48239 (313) 532-5750|2 miles away
##  4      4 "O\u0092REILLY AUTO PARTS"       27207 PLYMOUTH ROAD          Redford, MI  48239 (313) 937-1787|2 miles away
##  5      5 "O\u0092REILLY AUTO PARTS"      14975 TELEGRAPH ROAD          Redford, MI  48239 (313) 538-3584|2 miles away
##  6      6                   AutoZone           24250 FIVE MILE          Redford, MI  48239 (313) 527-6877|2 miles away
##  7      7 "O\u0092REILLY AUTO PARTS"        5940 MIDDLEBELT RD      Garden City, MI  48135 (734) 525-1607|3 miles away
##  8      8                   AutoZone        6228 MIDDLEBELT RD      Garden City, MI  48135 (734) 513-2233|3 miles away
##  9      9         Advance Auto Parts       3845 S Telegraph Rd         Dearborn, MI  48124 (313) 274-6549|3 miles away
## 10     10 "O\u0092REILLY AUTO PARTS"     27565 MICHIGAN AVENUE          Inkster, MI  48141 (313) 724-8544|3 miles away 

以防万一过程不明显,我们:

Just in case the process was non-obvious, we:

  • 按我们新创建的record 列对行进行分组
  • 将所有的文本打成一个字符串,每个部分用|
  • 分隔
  • 分离出所有单独的位

这应该有助于解释脆弱性.

That shld hopefully help explain the fragility.

当然,您只想要如何访问内容"部分,但希望这可以为您节省更多时间.

Granted, you only wanted the "how to get to the content" part, but hopefully this saved you some more time.

这篇关于如何使用 rvest 和 R 抓取 CGI-Bin?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆