如何使用 rvest 和 R 抓取 CGI-Bin? [英] How can I Scrape a CGI-Bin with rvest and R?

查看：55 发布时间：2021/7/14 18:39:46 r web-scraping rvest httr

本文介绍了如何使用 rvest 和 R 抓取 CGI-Bin?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 rvest 抓取在 cgi-bin 中弹出的网络表单的结果.但是，当我运行脚本时，结果在 200 英里内返回 0 个结果.以下是我的代码，我感谢任何反馈和帮助.主要网站是 http://www.zmax.com/ 有启动 cgi 的搜索框-bin.

I am trying to use rvest to scrape the results of a webform that pop up in a cgi-bin. However when I run the script I get back 0 results within 200 miles as the result. Below is my code I appreciate any feedback and help. The main website is http://www.zmax.com/ that has the search box that launches the cgi-bin.

library(rvest); 
library(purrr) ;
library(plyr) ;
library(dplyr) ;

x<-read_html('http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl') 

y<-x%>% html_node('table')%>% html_table(fill=true)

我也试过

y<-x%>% 
html_node('td div td, p')
%>% html_text()

我不确定返回表单上的数据时哪里出错了.

I am unsure of where I am going wrong in returning the data that is on the form.

推荐答案

奇怪的是，无论是主站点还是他们用于出口查找的提供程序都无法阻止 T&C 或 REP 进行抓取.¯\_(ツ)_/¯

Strangely enough, neither the main site nor the provider they use for outlet lookups prevents scraping by T&C or REP. ¯\_(ツ)_/¯

您应该真正熟悉浏览器开发人员工具，因为您已经能够看到主站点向查找站点发出 HTTP POST 请求与 GET请求浏览器通常会生成并且 read_html() 会生成.要获得成功的请求，您需要执行以下操作(我们会选择与您相近的邮政编码):

You should really get familiar with browser Developer Tools as you would have been able to see that the main site makes an HTTP POST request to the lookup site vs the GET request browsers normally make and that read_html() makes. Here's what you need to do to get successful requests (we'll pick a zip code near-ish you):

library(httr)
library(rvest)

POST(
  url = "http://www.nearestoutlet.com/cgi-bin/smi/findsmi.pl", 
  body = list(zipcode = "48127"), 
  encode = "form"
) -> res

res 是一个 httr response 对象，通常就可以:

res is an httr response object and one would normally just do:

content(res, as="parsed")

为 XML/HTML 剖析准备一个已解析的对象.但是，该站点上存在奇怪的编码问题(至少对我而言)，迫使我们不得不这样做:

to get a parsed object ready for XML/HTML dissection. But, there are weird encoding issues (at least for me) on that site forcing us to have to do:

content(res, as="raw") %>% read_html() -> pg

你应该 cat(as.character(pg)) 看看 HTML 有多丑.它是嵌套表，但不是很好.您在那里看到的条目都是元素，没有

中断.谢天谢地?在每个元素中只有单个

元素.因此，我们可以通过定位正确的

 一举抓住它们:
You should cat(as.character(pg)) to see how ugly the HTML is. It's nested tables, but not in a good way. The entries you see there are all <tr> elements with no <table> breaks. Thankfully? there are only singular <td> elements in each of those <tr> elements. So, we can grab them all in one fell swoop by targeting the correct <table>:
rows <- html_nodes(pg, "table[width='300'] > tr > td")
rows
## {xml_nodeset (60)}
##  [1] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
##  [2] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">6938 NORTH TELEGRAPH ROAD</font></td>
##  [3] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
##  [4] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 792-9134</font></td>
##  [5] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=6938+NORTH+TELEGRAPH+R ...
##  [6] <td width="300" height="6"></td>
##  [7] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Advance Auto Parts</b></font></p ...
##  [8] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8120 North Telegraph Road</font></td>
##  [9] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Dearborn Heights, MI  48127</font></td>
## [10] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 528-4920</font></td>
## [11] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8120+North+Telegraph+R ...
## [12] <td width="300" height="6"></td>
## [13] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>Pep Boys</b></font></p></td>
## [14] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">8955 TELEGRAPH RD</font></td>
## [15] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">Redford, MI  48239</font></td>
## [16] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">(313) 532-5750</font></td>
## [17] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px"><a href="#" onclick="window.open('http://maps.google.com/maps?q=8955+TELEGRAPH+RD+Redf ...
## [18] <td width="300" height="6"></td>
## [19] <td width="300" height="19" bgcolor="#8B0101"><p align="left"><font face="Tahoma" color="#FFFFFF" style="font-size: 11px"><b>O\u0092REILLY AUTO PARTS</b></fo ...
## [20] <td width="300" height="2"><font face="Tahoma" style="font-size: 11px">27207 PLYMOUTH ROAD</font></td>
## ...
有许多方法可以用来从混乱中制作数据框.一个简单的方法涉及使用商店标题具有设置背景颜色而其他名称没有的事实.这使得代码有点脆弱，但我们可以通过测试背景颜色的存在来帮助它降低脆弱性.为什么我们甚至需要这样做?好吧，我们需要标记记录的开始和结束，一个简单的方法是使用我们可以 cumsum() 一个逻辑向量，知道它 FALSE== 0. 为什么这很重要?我们可以通过这种方式创建一个隐式分组列:
There are many approaches one could take to make a data frame out of that mess. One simple one involves using the fact that the store titles have a set background color while the others do not. This makes the code a bit fragile, but we can help it be less fragile by just testing for the presence of a background color. Why do we even need to do this? Well, we need to mark start and end of records and one easy way to do this is use the fact that we can cumsum() a logical vector, knowing that it FALSE == 0. Why does that matter? We can create an implicit grouping column that way:
data_frame(
  record = !is.na(html_attr(rows, "bgcolor")),
  text = html_text(rows, trim=TRUE)
) %>% 
  mutate(record = cumsum(record)) -> xdf
#3 # A tibble: 60 x 2
#3    record                        text
#3     <int>                       <chr>
#3  1      1  "O\u0092REILLY AUTO PARTS"
#3  2      1   6938 NORTH TELEGRAPH ROAD
#3  3      1 Dearborn Heights, MI  48127
#3  4      1              (313) 792-9134
#3  5      1                0 miles away
#3  6      1                            
#3  7      2          Advance Auto Parts
#3  8      2   8120 North Telegraph Road
#3  9      2 Dearborn Heights, MI  48127
#3 10      2              (313) 528-4920
#3 # ... with 50 more rows
现在，我们需要使用 filter() 删除空行，并进行一些调整以将数据转换为合适的形式来制作数据框.这是超级脆弱的代码，因为这个特定的代码段可以处理丢失的电话号码数据，但仅此而已.如果有第二个地址行，您需要修改此方法或使用不同的方法:
Now, we need to remove the empty rows with filter() and do some munging to get the data into a decent form for making a data frame. This is super fragile code in that this particular snippet can handle missing phone number data but that's about it. If there's a second address line, you'll need to modify this approach or use a different approach:
filter(xdf, text != "") %>% 
  group_by(record) %>% 
  summarise(x = paste0(text, collapse="|")) %>% 
  separate(x, c("store", "address1", "city_state_zip", "phone_and_or_distance"), sep="\\|", extra="merge")
## # A tibble: 10 x 5
##    record                      store                  address1              city_state_zip       phone_and_or_distance
##  *  <int>                      <chr>                     <chr>                       <chr>                       <chr>
##  1      1 "O\u0092REILLY AUTO PARTS" 6938 NORTH TELEGRAPH ROAD Dearborn Heights, MI  48127 (313) 792-9134|0 miles away
##  2      2         Advance Auto Parts 8120 North Telegraph Road Dearborn Heights, MI  48127 (313) 528-4920|0 miles away
##  3      3                   Pep Boys         8955 TELEGRAPH RD          Redford, MI  48239 (313) 532-5750|2 miles away
##  4      4 "O\u0092REILLY AUTO PARTS"       27207 PLYMOUTH ROAD          Redford, MI  48239 (313) 937-1787|2 miles away
##  5      5 "O\u0092REILLY AUTO PARTS"      14975 TELEGRAPH ROAD          Redford, MI  48239 (313) 538-3584|2 miles away
##  6      6                   AutoZone           24250 FIVE MILE          Redford, MI  48239 (313) 527-6877|2 miles away
##  7      7 "O\u0092REILLY AUTO PARTS"        5940 MIDDLEBELT RD      Garden City, MI  48135 (734) 525-1607|3 miles away
##  8      8                   AutoZone        6228 MIDDLEBELT RD      Garden City, MI  48135 (734) 513-2233|3 miles away
##  9      9         Advance Auto Parts       3845 S Telegraph Rd         Dearborn, MI  48124 (313) 274-6549|3 miles away
## 10     10 "O\u0092REILLY AUTO PARTS"     27565 MICHIGAN AVENUE          Inkster, MI  48141 (313) 724-8544|3 miles away 
以防万一过程不明显，我们:
Just in case the process was non-obvious, we:
按我们新创建的record 列对行进行分组
将所有的文本打成一个字符串，每个部分用|的
分隔分离出所有单独的位
这应该有助于解释脆弱性.
That shld hopefully help explain the fragility.
当然，您只想要如何访问内容"部分，但希望这可以为您节省更多时间.
Granted, you only wanted the "how to get to the content" part, but hopefully this saved you some more time.
这篇关于如何使用 rvest 和 R 抓取 CGI-Bin?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

                        查看全文
                    

                            
                        

            



        

            相关文章
            
                    
                        
                            使用python unicode cgi-bin脚本需要帮助;
                        
                    
                    
                        
                            使用 R 和 rvest 进行网页抓取;
                        
                    
                    
                        
                            使用 rvest 和 R 进行网页抓取;
                        
                    
                    
                        
                            用作 cgi-bin 时如何使用 setuid() 成功运行 Perl 脚本?;
                        
                    
                    
                        
                            使用 rvest 和 purrr 抓取 R，多页;
                        
                    
                    
                        
                            使用 R 和 rvest 抓取财务数据;
                        
                    
                    
                        
                            如何读取cgi-bin文件夹之外的文件?;
                        
                    
                    
                        
                            R:使用 rvest 进行 LinkedIn 抓取;
                        
                    
                    
                        
                            （13）权限被拒绝：访问被拒绝/cgi-bin/test.cgi;
                        
                    
                    
                        
                            使用apache2.4.18设置cgi-bin服务器python;
                        
                    
                    
                        
                            urllib中http://www.ssa.gov/cgi-bin/popularnames.cgi的（大概是基本的）网络抓取;
                        
                    
                    
                        
                            如何从我的网址中隐藏“ cgi-bin”，“。py”等？;
                        
                    
                    
                        
                            Apache (2) 抛出“No such file or directory: exec of '/usr/lib/cgi-bin/fst.cgi' failed";
                        
                    
                    
                        
                            R 使用 rvest 和 V8 进行网页抓取;
                        
                    
                    
                        
                            如何成功运行使用setuid Perl脚本（）作为cgi-bin目录中使用时？;
                        
                    
                    
                        
                            Rvest网络抓取有限结果(R);
                        
                    
                    
                        
                            CGI-BIN脚本没有得到一个用户运行？;
                        
                    
                    
                        
                            基于CGI-BIN的Web开发的主要缺点是什么？;
                        
                    
                    
                        
                            在python中将图像返回给浏览器，cgi-bin;
                        
                    
                    
                        
                            使用 R 和 Rvest 抓取和提取 XML 站点地图元素;
                        
                    
                    
                        
                            在 documentroot(cgi-bin 文件夹)之外运行 PHP 文件;
                        
                    
                    
                        
                            www.paypal.com/jp/cgi-bin/webscr? item_name编码;
                        
                    
                    
                        
                            在documentroot之外运行PHP文件(cgi-bin文件夹);
                        
                    
                    
                        
                            R 编程中的网页抓取 (rvest);
                        
                    
                    
                        
                            R-用RVest进行Web抓取;
                        
                    
            
        

            



        

        
            其他开发最新文章
            
                    
                        
                            拒绝显示一个框架，因为它将'X-Frame-Options'设置为'sameorigin';
                        
                    
                    
                        
                            什么是＆QUOT; AW＆QUOT;在部分标志属性是什么意思？;
                        
                    
                    
                        
                            在运行npm install命令时获取'npm WARN弃用'警告;
                        
                    
                    
                        
                            cmake无法找到openssl;
                        
                    
                    
                        
                            从Spark的scala中的* .tar.gz压缩文件中读取HDF5文件;
                        
                    
                    
                        
                            Twitter :: Error :: Forbidden  - 无法验证您的凭据;
                        
                    
                    
                        
                            我什么时候需要一个fb：app_id或者fb：admins？;
                        
                    
                    
                        
                            将.db文件导入R;
                        
                    
                    
                        
                            npm通知创建一个lockfile作为package-lock.json。你应该提交这个文件;
                        
                    
                    
                        
                            拒绝执行内联脚本，因为它违反了以下内容安全策略指令：“script-src'self'”;
                        
                    
            
        
        
            
                热门教程
            
            
                
                    
                        Java教程
                    
                
                
                    
                        Apache ANT 教程
                    
                
                
                    
                        Kali Linux教程
                    
                
                
                    
                        JavaScript教程
                    
                
                
                    
                        JavaFx教程
                    
                
                
                    
                        MFC 教程
                    
                
                
                    
                        Apache HTTP客户端教程
                    
                
                
                    
                        Microsoft Visio 教程
                    
                
            
        
        
            
                热门工具
            
            
                
                
                    
                        Java 在线工具
                    
                
                
                    
                        C(GCC) 在线工具
                    
                
                
                    
                        PHP 在线工具
                    
                
                
                    
                        C# 在线工具
                    
                
                
                    
                        Python 在线工具
                    
                
                
                    
                        MySQL 在线工具
                    
                
                
                    
                        VB.NET 在线工具
                    
                
                
                    
                        Lua 在线工具
                    
                
                
                    
                        Oracle 在线工具
                    
                
                
                    
                        C++(GCC) 在线工具
                    
                
                
                    
                        Go 在线工具
                    
                
                
                    
                        Fortran 在线工具
                    
                
            
        
        
    

    
        
            登录
            关闭
        
        
            
                扫码关注1秒登录
            
            
                
            
            
                
                
            
            
                发送“验证码”获取
                |
                15天全站免登陆
            
            
        
    

    
		
			友情链接：
            IT屋
            Chrome插件
            谷歌浏览器插件
        
        
            IT屋
            ©2016-2022 琼ICP备2021000895号-1
            站点地图
            站点标签
            SiteMap
            <免责申明>
            本站内容来源互联网,如果侵犯您的权益请联系我们删除.

如何使用 rvest 和 R 抓取 CGI-Bin? [英] How can I Scrape a CGI-Bin with rvest and R?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭