搜寻包含R的JS/jquery代码的网站 [英] Scraping website that include JS/jquery code with R
问题描述
我想用不同的搜索方式从本网站中提取超链接(不要害怕它是在丹麦).可以在右侧找到超链接(v15,v14,v13等)[
I want to extract the hyperlinks from this website with different searches (dont be scared that it is in Danish) . The hyperlinks can be found to the right (v15, v14, v13 etc) [example]. The website I try to scrape somehow uses the search results from some kind of a jquery/javascript. This is based on my very limited knowledge in HTML and might be wrong.
我认为这个事实使以下代码无法运行(我使用"rvest"-程序包):
I think this fact makes the following code unable to run (I use the "rvest"-package):
sdslink="http://karakterstatistik.stads.ku.dk/#searchText=&term=&block=&institute=null&faculty=&searchingCourses=true&page=1"
s_link = recs %>%
read_html(encoding = "UTF-8") %>%
html_nodes("#searchResults a") %>%
html_attr("href")
我找到了一种有效的方法,但需要我为每个页面使用右键单击" +另存为"手动下载页面.但是,这是不可行的,因为我要抓取总共100页的超链接.
I have found a method that works but it requires me to download the pages manually with "right click"+"save as" for each page. This is however unfeasible as I want to scrape a total of 100 pages for hyperlinks.
我尝试将jsonlite软件包与httr结合使用,但似乎找不到正确的.json文件.
I have tried to use the jsonlite package combined with httr but I am not able to find the right .json file it seems.
我希望你们有一个解决方案,要么使jsonlite正常工作,自动执行另存为"解决方案,要么提供第三条更聪明的路径.
I hope you guys might have a solution, either to get the jsonlite to work, automate the "save as" solution or a third more clever path.
推荐答案
One approach is to use RSelenium. Here's some simple code to get you started. I assume you already have RSelenium and a webdriver installed. Navigate to your site of interest:
library(RSelenium)
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444,
browserName = "chrome")
remDr$open(silent = TRUE)
remDr$navigate("http://karakterstatistik.stads.ku.dk/")
通过检查源来找到submit
按钮:
Find the submit
button by inspecting the source:
webElem <- remDr$findElement("name", "submit")
webElem$clickElement()
保存前5页:
html_source <- vector("list", 5)
i <- 1
while (i <= 5) {
html_source[[i]] <- remDr$getPageSource()
webElem <- remDr$findElement("id", "next")
webElem$clickElement()
Sys.sleep(2)
i <- i + 1
}
remDr$close()
这篇关于搜寻包含R的JS/jquery代码的网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!