从 R 中的 URL 链接网页抓取所需内容的问题 [英] Problem with web scraping of required content from a URL link in R

查看:33
本文介绍了从 R 中的 URL 链接网页抓取所需内容的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用脚本从包含不同主题的链接中抓取所需的内容.

I am using a script to scrape the required content from a link in which there are different subjects.

library(rvest)
url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"

query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

在上面的查询中 sel_subj 针对每个不同的主题而变化

In the above query sel_subj changes for every different subjects

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes(xpath = "//th/a") %>% html_text()
instructor_nodes <- html %>% 
  html_nodes(xpath = "//td[@class='dddefault']/a[contains(@href, 'mailto')]")

instructors <- html_attr(instructor_nodes, "target") 
emails <- html_attr(instructor_nodes, "href")

length(classes)
[1] 32
length(instructors)
[1] 39
length(emails)
[1] 39

sq <- seq(max(length(classes), length(instructors), length(emails)))
data.frame(classes[sq], instructors[sq], emails[sq])

结果如下所示,这是错误的:

And the result looks like below which is wrong:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore mailto:mkchsmre@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Michael K. Chisamore mailto:mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Marika E. Snider mailto:mesnider@memphis.edu
7       Computer Apps in Design 2 - 11111 - ARCH 2612 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
8                 Design Studio 2 - 11112 - ARCH 2712 - 001   Timothy E. Michael mailto:tmichael@memphis.edu
9                 Design Studio 2 - 15408 - ARCH 2712 - 002      Andrew M. Parks  mailto:amparks@memphis.edu
10  Survey of Interiors+Furniture - 25734 - ARCH 3213 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
11  Determinants of Modern Design - 27436 - ARCH 3221 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
12            Structural Design 2 - 23837 - ARCH 3322 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
13          Professional Practice - 25097 - ARCH 3431 - 001      Andrew M. Parks  mailto:amparks@memphis.edu
14                Design Studio 4 - 11115 - ARCH 3714 - 001         Sonia Raheel  mailto:sraheel@memphis.edu
15                Design Studio 4 - 23221 - ARCH 3714 - 002     Pamela J. Hurley mailto:pjhurley@memphis.edu
16 Architecture Independent Study - 11117 - ARCH 4021 - 201   Jennifer L. Barker mailto:jlbrker1@memphis.edu
17             Sustainable Design - 19491 - ARCH 4421 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
18     Internship in Architecture - 21000 - ARCH 4430 - 001     Marika E. Snider mailto:mesnider@memphis.edu
19                Design Studio 6 - 11134 - ARCH 4716 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
20             Sustainable Design - 19492 - ARCH 6421 - 001     Marika E. Snider mailto:mesnider@memphis.edu
21      Advanced Design Seminar 2 - 18387 - ARCH 7012 - 001     Marika E. Snider mailto:mesnider@memphis.edu
22    Contemporary Architecture 2 - 24104 - ARCH 7222 - 001     Pamela J. Hurley mailto:pjhurley@memphis.edu
23     Internship in Architecture - 19495 - ARCH 7430 - 001   Jennifer L. Barker mailto:jlbrker1@memphis.edu
24      Adv Professional Practice - 19496 - ARCH 7431 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
25       Advanced Design Studio 2 - 18389 - ARCH 7712 - 001     Michael D. Hagge  mailto:mdhagge@memphis.edu
26          Architecture Research - 25098 - ARCH 7930 - 001     Brian D. Andrews mailto:bdndrews@memphis.edu
27     Architecture Thesis Studio - 19499 - ARCH 7996 - 003 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
28     Architecture Thesis Studio - 19500 - ARCH 7996 - 004     Brian D. Andrews mailto:bdndrews@memphis.edu
29     Architecture Thesis Studio - 19501 - ARCH 7996 - 005      Andrew M. Parks  mailto:amparks@memphis.edu
30     Architecture Thesis Studio - 19502 - ARCH 7996 - 006     Michael D. Hagge  mailto:mdhagge@memphis.edu
31     Architecture Thesis Studio - 19503 - ARCH 7996 - 007     Brian D. Andrews mailto:bdndrews@memphis.edu
32     Architecture Thesis Studio - 20972 - ARCH 7996 - 008 Michael K. Chisamore mailto:mkchsmre@memphis.edu
33                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
34                                                     <NA>   Jennifer L. Barker mailto:jlbrker1@memphis.edu
35                                                     <NA> Michael K. Chisamore mailto:mkchsmre@memphis.edu
36                                                     <NA>     Pamela J. Hurley mailto:pjhurley@memphis.edu
37                                                     <NA> Jennifer L. Thompson mailto:jlthmps5@memphis.edu
38                                                     <NA>     Brian D. Andrews mailto:bdndrews@memphis.edu
39                                                     <NA>     Marika E. Snider mailto:mesnider@memphis.edu

但是在链接中,数据看起来不一样.
例如:
很少有课程没有任何导师和电子邮件(提到TBA),如下所示:

But in the link, the data looks different.
For example:
There are few classes without any instructor and email (It is mentioned TBA) like below:

很少有其他班级有两名/三名/四名/多名教师.

并且很少有其他课程具有多次提供相同的讲师,如下所示:

And there are few other classes with the same instructor given multiple times like below:

对于此类数据,我希望我的输出如下所示:

For such data I want my output to be looked like below:

                                                classes.sq.      instructors.sq.                  emails.sq.
1   Fundamentals of Design Studio - 23838 - ARCH 1111 - 001 Jennifer L. Thompson mailto:jlthmps5@memphis.edu
2   Fundamentals of Design Studio - 23839 - ARCH 1111 - 002          TBA         
3            Design Visualization - 11107 - ARCH 1113 - 001 Michael K. Chisamore,Pamela J. Hurley mailto:mkchsmre@memphis.edu,pjhurley@memphis.edu
4            Design Visualization - 18386 - ARCH 1113 - 002 Pamela J. Hurley,Michael K. Chisamore mailto:pjhurley@memphis.edu,mkchsmre@memphis.edu
5       History of Architecture 1 - 23218 - ARCH 1211 - 001     Marika E. Snider mailto:mesnider@memphis.edu
6           Building Technology 2 - 23840 - ARCH 2412 - 001     Timothy E. Michael mailto:tmichael@memphis.edu

附言如果发布的 URL 链接不起作用.请按照此:

P.S. if the posted URL link doesn't work. Please follow this:

In this link `https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_disp_dyn_sched` 
Select by term -> Spring Term 2021 (view only) -> Submit
Subject -> select ARCH Architecture -> scroll down and click Class Search

如何处理缺失数据(待定)、多个导师、多次给同一个导师?

How to deal with missing data (TBA), multiple instructors, and the same instructor given multiple times?

推荐答案

问题在于使用 html_nodes() 函数.此函数将返回一个值列表,而不考虑找到该值的节点.由于您的网页有时每个班级有多名教师或没有,因此需要更有针对性的方法.

The problem is with using the html_nodes() function. This function will return a list of values without any regard to which node the value was found. Since you webpage will have sometime have multiple instructors per class or none, a more targeted approach is needed.

在这个代码块中,我们首先找到包含我们想要的所有信息的每个类节点.然后我们单独解析每个节点(在 lapply 函数内)以提取教师和电子邮件,同时检查空字段.每个教师的每个数据框中只有一行,因此如果有多个教师,某些数据框将有多行.

In this code block we first find each of the class nodes which contain all of the information we want. Then we parse each of those node individually (inside the lapply function) to extract the instructors and email also checking for empty fields. There is a single line in each data frame for each instructor, so some data frame will have multiple lines if there are multiple instructors.

我们为每个班级组装一个数据框列表(bind_rows),然后合并同一班级的讲师和电子邮件结果

We assemble a list of data frames (bind_rows) for each class and then merge the instructor and email results for the same class

library(rvest)
library(dplyr)

url   <- "https://ssb.bannerprod.memphis.edu/prod/bwckschd.p_get_crse_unsec"

query <- list(term_in = "202110", sel_subj = "dummy", sel_day = "dummy",
              sel_schd = "dummy", sel_insm = "dummy", sel_camp = "dummy",
              sel_levl = "dummy", sel_sess = "dummy", sel_instr = "dummy",
              sel_ptrm = "dummy", sel_attr = "dummy", sel_subj = "ARCH",
              sel_crse = "",      sel_title = "",     sel_insm = "%",
              sel_from_cred = "", sel_to_cred = "",   sel_camp = "%",
              sel_levl = "%",     sel_ptrm = "%",     sel_instr = "%",
              sel_attr = "%",     begin_hh =  "0",    begin_mi = "0",
              begin_ap = "a",     end_hh = "0",       end_mi = "0",
              end_ap = "a")

html <- read_html(httr::POST(url, body = query))
classes <- html %>% html_nodes("th.ddtitle") %>% html_text()

classinfo <- html %>% html_nodes("tr td.dddefault")
classinfo <- html %>% html_nodes(xpath = ".//tr/td[@class='dddefault']") 
classinfo <- classinfo[nchar( html_text(classinfo))>50 ]   #eliminate the extra found nodes

classlink <- classinfo %>% html_nodes("a") %>% html_attr("href")  #find all links
classlinktext <- classinfo %>% html_nodes("a") %>% html_text()    #find the link text
classlink <- classlink[classlinktext=="View Catalog Entry"]       #keep only the links for "View Catalog Entry"

dfs <-lapply(1:length(classinfo), function(i) {
 # classname <-classes[i] %>% html_node(xpath = ".//a") %>% html_text()
  instructor_node <- classinfo[i] %>% html_nodes("table.datadisplaytable") %>% 
    html_nodes(xpath = ".//a[contains(@href, 'mailto')]")
  
  instructors <- html_attr(instructor_node, "target") 
  emails <- html_attr(instructor_node, "href")
  #check to see if instructor was assign if not TBD
  if(length(instructors)==0){
    instructors <- "TBD"
    emails <- "NA"
  }
  data.frame(classname=classes[i], link=classlink[i], instructors, emails)
})
   
#merge list into data frame
answer<- bind_rows(dfs)

#consolidation the instructions in the same class
finalanswer<-answer %>% group_by(classes) %>% summarize(instructors2 = paste(instructors, collapse = ", "), emails = paste(emails, collapse = ", "))
# the paste(instructors, collapse = ", ") could be contained within the lapply 
# loop but adding it here add some flexibility depending on whether
# answer or final answer is the end result.
head(finalanswer, 16)
tail(finalanswer, 16)

这篇关于从 R 中的 URL 链接网页抓取所需内容的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆