R中的抓取网站 [英] Scraping website in R

查看:90
本文介绍了R中的抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从此网站上刮掉专业人员的姓名,城市,州,电子邮件等, http://www.napo.net/search/newsearch.asp 使用rvest,但我似乎无法使用选择器小工具获取CSS选择器,并且电子邮件受JavaScript保护.

I am trying to scrape the name, city, state, email, etc of professionals from this website http://www.napo.net/search/newsearch.asp using rvest, but I can't seem to get the CSS selectors using selector gadget and the e-mails are protected with JavaScript.

我已经检查了论坛,没有看到任何类似的问题.

I have checked the forums and haven't seen any issue like this.

推荐答案

此解决方案使用seleniumPipes和RSelenium包.您还应该下载phantomjs,将其解压缩并将.exe文件放入R工作目录中.
此方法使用模拟用户行为的无头浏览器(phantomjs).它可以读取javascript生成的值.

This solution uses seleniumPipes and RSelenium package. You should also download phantomjs ,unzip it and put .exe file in in your R working directory.
This method uses a headless browser(phantomjs) which simulates user behavior. It can read javascript generated values.

library(rvest)
library(RSelenium) # start a server with utility function
library(seleniumPipes)
rD <- rsDriver (browser = 'chrome',chromever = "latest",port = 4444L)
#open browser
remDr <- remoteDr(browserName = "chrome")

main_page_url <- "http://www.napo.net/search/newsearch.asp"
#go to home page
remDr %>% go(main_page_url)
#switch to iframe
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#get all relative path
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
#all individual urls:
full_paths <- paste0("http://www.napo.net",relative_path)
#scrape email from each page
email_address <- list()
#Retrieve email adress from the first three results
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address,temp_list)
    Sys.sleep(3)
}
#display result
email_address[1]
    $email
[1] "marla@123organize.com"

如果要转到第二页,以上都是第一页:

Above are all for page one, if you want to turn to page two:

remDr %>% go(main_page_url)
remDr %>% switchToFrame(Id = "SearchResultsFrame")
#click on page two on iframe to turn to page 2:
remDr %>% findElement(using = "css selector",value = ".DotNetPager a:nth-child(2)") %>% elementClick()
#get relative and full path again
relative_path <- remDr %>% getPageSource() %>% html_nodes(".lineitem a[href]") %>% html_attr("href")
full_paths <- paste0("http://www.napo.net",relative_path)
#And you can do the for loop again
for(i in seq_along(full_paths[1:3])){
    remDr %>% go(full_paths[i])
    email_adress <- remDr %>% getPageSource()  %>% html_nodes('a[href^="mailto"]') %>% html_text()
    temp_list <- list(email = email_adress)
    email_address <- c(email_address,temp_list)
    Sys.sleep(3)
}
#display result[6]
$email
[1] "lynette@itssimplyplaced.com"

email_address
#You can also do a loop to scrape all pages
#-----
#delete session and close server
remDr %>% deleteSession()
rD[["server"]]$stop() 

这篇关于R中的抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆