从 R 中的 URL 链接网页抓取所需内容的问题 [英] Problem with web scraping of required content from a URL link in R
问题描述
我正在使用脚本从包含不同主题的链接中抓取所需的内容.
I am using a script to scrape the required content from a link in which there are different subjects.
library(rvest)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
在上面的查询中 sel_subj
针对每个不同的主题而变化
In the above query sel_subj
changes for every different subjects
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>%
html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_nodes, "target")
emails <- html_attr(instructor_nodes, "href")
length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39
sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])
结果如下所示,这是错误的:
And the result looks like below which is wrong:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Marika E. Snider mailto:mesnider@memphis.edu
7 Computer Apps in Design 2 - 11111 - ARCH 2612 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
8 Design Studio 2 - 11112 - ARCH 2712 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
9 Design Studio 2 - 15408 - ARCH 2712 - 002 Andrew M. Parks mailto:amparks@memphis.edu
10 Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001 Andrew M. Parks mailto:amparks@memphis.edu
11 Determinants of Modern Design - 27436 - ARCH 3221 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
12 Structural Design 2 - 23837 - ARCH 3322 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
13 Professional Practice - 25097 - ARCH 3431 - 001 Andrew M. Parks mailto:amparks@memphis.edu
14 Design Studio 4 - 11115 - ARCH 3714 - 001 Sonia Raheel mailto:sraheel@memphis.edu
15 Design Studio 4 - 23221 - ARCH 3714 - 002 Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201 Jennifer L. Barker mailto:jlbrker1@memphis.edu
17 Sustainable Design - 19491 - ARCH 4421 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
18 Internship in Architecture - 21000 - ARCH 4430 - 001 Marika E. Snider mailto:mesnider@memphis.edu
19 Design Studio 6 - 11134 - ARCH 4716 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
20 Sustainable Design - 19492 - ARCH 6421 - 001 Marika E. Snider mailto:mesnider@memphis.edu
21 Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001 Marika E. Snider mailto:mesnider@memphis.edu
22 Contemporary Architecture 2 - 24104 - ARCH 7222 - 001 Pamela J. Hurley mailto:pjhurley@memphis.edu
23 Internship in Architecture - 19495 - ARCH 7430 - 001 Jennifer L. Barker mailto:jlbrker1@memphis.edu
24 Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25 Advanced Design Studio 2 - 18389 - ARCH 7712 - 001 Michael D. Hagge mailto:mdhagge@memphis.edu
26 Architecture Research - 25098 - ARCH 7930 - 001 Brian D. Andrews mailto:bdndrews@memphis.edu
27 Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28 Architecture Thesis Studio - 19500 - ARCH 7996 - 004 Brian D. Andrews mailto:bdndrews@memphis.edu
29 Architecture Thesis Studio - 19501 - ARCH 7996 - 005 Andrew M. Parks mailto:amparks@memphis.edu
30 Architecture Thesis Studio - 19502 - ARCH 7996 - 006 Michael D. Hagge mailto:mdhagge@memphis.edu
31 Architecture Thesis Studio - 19503 - ARCH 7996 - 007 Brian D. Andrews mailto:bdndrews@memphis.edu
32 Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
34 <NA> Jennifer L. Barker mailto:jlbrker1@memphis.edu
35 <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36 <NA> Pamela J. Hurley mailto:pjhurley@memphis.edu
37 <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38 <NA> Brian D. Andrews mailto:bdndrews@memphis.edu
39 <NA> Marika E. Snider mailto:mesnider@memphis.edu
但是在链接中,数据看起来不一样.
例如:
很少有课程没有任何导师和电子邮件
(提到TBA
),如下所示:
But in the link, the data looks different.
For example:
There are few classes without any instructor and email
(It is mentioned TBA
) like below:
很少有其他班级有两名/三名/四名/多名教师
.
并且很少有其他课程具有多次提供相同的讲师
,如下所示:
And there are few other classes with the same instructor given multiple times
like below:
对于此类数据,我希望我的输出如下所示:
For such data I want my output to be looked like below:
classes.sq. instructors.sq. emails.sq.
1 Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2 Fundamentals of Design Studio - 23839 - ARCH 1111 - 002 TBA
3 Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4 Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5 History of Architecture 1 - 23218 - ARCH 1211 - 001 Marika E. Snider mailto:mesnider@memphis.edu
6 Building Technology 2 - 23840 - ARCH 2412 - 001 Timothy E. Michael mailto:tmichael@memphis.edu
附言如果发布的 URL 链接不起作用.请按照此:
P.S. if the posted URL link doesn't work. Please follow this:
In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched`
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search
如何处理缺失数据(待定)、多个导师、多次给同一个导师?
How to deal with missing data (TBA), multiple instructors, and the same instructor given multiple times?
推荐答案
问题在于使用 html_nodes()
函数.此函数将返回一个值列表,而不考虑找到该值的节点.由于您的网页有时每个班级有多名教师或没有,因此需要更有针对性的方法.
The problem is with using the html_nodes()
function. This function will return a list of values without any regard to which node the value was found. Since you webpage will have sometime have multiple instructors per class or none, a more targeted approach is needed.
在这个代码块中,我们首先找到包含我们想要的所有信息的每个类节点.然后我们单独解析每个节点(在 lapply
函数内)以提取教师和电子邮件,同时检查空字段.每个教师的每个数据框中只有一行,因此如果有多个教师,某些数据框将有多行.
In this code block we first find each of the class nodes which contain all of the information we want. Then we parse each of those node individually (inside the lapply
function) to extract the instructors and email also checking for empty fields. There is a single line in each data frame for each instructor, so some data frame will have multiple lines if there are multiple instructors.
我们为每个班级组装一个数据框列表(bind_rows
),然后合并同一班级的讲师和电子邮件结果
We assemble a list of data frames (bind_rows
) for each class and then merge the instructor and email results for the same class
library(rvest)
library(dplyr)
url <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"
query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
sel_crse = "", sel_title = "", sel_insm = "%",
sel_from_cred = "", sel_to_cred = "", sel_camp = "%",
sel_levl = "%", sel_ptrm = "%", sel_instr = "%",
sel_attr = "%", begin_hh = "0", begin_mi = "0",
begin_ap = "a", end_hh = "0", end_mi = "0",
end_ap = "a")
html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") %>% html_text()
classinfo <- html %>% html_nodes("tr td.dddefault")
classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']")
classinfo <- classinfo[nchar( html_text(classinfo))>50 ] #eliminate the extra found nodes
classlink <- classinfo %>% html_nodes("a") %>% html_attr("href") #find all links
classlinktext <- classinfo %>% html_nodes("a") %>% html_text() #find the link text
classlink <- classlink[classlinktext=="View Catalog Entry"] #keep only the links for "View Catalog Entry"
dfs <-lapply(1:length(classinfo), function(i) {
# classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>%
html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
instructors <- html_attr(instructor_node, "target")
emails <- html_attr(instructor_node, "href")
#check to see if instructor was assign if not TBD
if(length(instructors)==0){
instructors <- "TBD"
emails <- "NA"
}
data.frame(classname=classes[i], link=classlink[i], instructors, emails)
})
#merge list into data frame
answer<- bind_rows(dfs)
#consolidation the instructions in the same class
finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
# the paste(instructors, collapse = ", ") could be contained within the lapply
# loop but adding it here add some flexibility depending on whether
# answer or final answer is the end result.
head(finalanswer, 16)
tail(finalanswer, 16)
这篇关于从 R 中的 URL 链接网页抓取所需内容的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!