如何循环到达每个类链接并提取R中的属性容量座 [英] How to loop to reach each class link and extract out the attribute capacity seats in R

查看:106
本文介绍了如何循环到达每个类链接并提取R中的属性容量座的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我实际上想为该链接中存在的每个class提取capacity (seats)属性.这是实际的链接https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec

I actually wanted to extract the capacity (seats) attribute for each class present in this link. This is the actual link https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec

如果发布的链接不起作用:请这样做

If the posted link doesn't work: Please do this

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search

例如:

对于主题ARCH,这些类如下所示:

For the subject ARCH, the classes look like below:

以上图片只是主题ARCH的几类.尽管如此,还是有很多班级.如果单击每个类,您将看到显示seats数字的属性capacity.

The above pictures are only a few classes of subject ARCH. Still, there are many classes. If you click each class you will see the attribute capacity which shows the seats number.

我希望输出如下所示:

classes                                                          capacity - seats
Fundamentals of Design Studio - 23839 - ARCH 1111 - 002             15
Design Visualization - 11107 - ARCH 1113 - 001                      15
Building Technology 2 - 23840 - ARCH 2412 - 001                     20

如何在R中循环以获取每个subject的每个classcapacity (seats)属性.

How to make a loop in R to get the capacity (seats) attribute for each class of each subject.

P.S.这个问题是我上一篇文章https://stackoverflow.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r

P.S. This question is a continuation of my previous post https://stackoverflow.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r

推荐答案

此解决方案与以前的解决方案非常相似.
由于指向类大小的链接与类标题位于相同的节点中,因此更为简单.根据哪些信息,您需要在合并剩余数据之前清除哪些类大小表.

This solution is very similar to the previous solution.
It is more straight forward since the link to the class size is in the same node as the class title. Depending on what information you what the class size table will need to be cleaned up before merging with the remaining data.

另外,由于一个人将在网站上查询多个页面,因此请稍加暂停一下系统,以保持礼貌并避免像黑客一样出现.
请注意,不会进行错误检查以确保可以使用正确的表,建议您在制作此生产代码之前考虑一下.

Also since one will be querying multiple pages on the site, please introduce a slight system pause to be polite and to avoid appearing like a hacker.
Note there is no error checking to ensure the correct table is available, I suggest you consider this before making this production code.

#https://stackoverflow.com/questions/64515601/problem-with-web-scraping-of-required-content-from-a-url-link-in-r/64517844#64517844
library(rvest)
library(dplyr)

# In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
# Select by term -> Spring Term 2021 (view only) -> Submit
# Subject -> select ARCH Architecture -> scroll down and click Class Search

url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") 

dfs<-lapply(classes, function(class) {
   #get class name
   classname <-class %>% html_text()
   print(classname)
   #Pause in order not be a denial of service attach
   Sys.sleep(0.5)
   classlink <- class %>% html_node("a") %>% html_attr("href")
   fulllink <- paste0("https://ssb.bannerprod.memphis.edu", classlink)
   
   newpage <-read_html(fulllink)
   #find the tables 
   tables <- newpage %>% html_nodes("table.datadisplaytable") 
   #find the index to the correct table 
   seatingtable <- which(html_attr(tables, "summary") == "This layout table is used to present the seating numbers.")
   size <-tables[seatingtable] %>% html_table(header=TRUE)
   #may want to clean up table before combining in dataframe
   # i.e  size[[1]][1, -1]
   data.frame(class=classname, size[[1]], link=fulllink)
})

answer <- bind_rows(dfs)

这篇关于如何循环到达每个类链接并提取R中的属性容量座的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆