在R中从字符串向量中刮掉HTML [英] Scraping HTML from vector of strings in R

查看:199
本文介绍了在R中从字符串向量中刮掉HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

建立在我的前一个问题的答案上我在抓取本网站中使用以下代码链接Rselenium包:

(remoteServerAddr =localhost,port = 4444,
browserName =chrome)

remDr $ open(silent = TRUE)
remDr $ navigate(http://karakterstatistik.stads.ku.dk/)
Sys.sleep(2)

webElem< ; - remDr $ findElement(name,submit)
webElem $ clickElement()
Sys.sleep(5)

html_source< - vector(list ,100)
i < - 1
while(i <= 100){
html_source [[i]]< - remDr $ getPageSource()
webElem< - remDr $ findElement(id,next)
webElem $ clickElement()
Sys.sleep(2)
i < - i + 1
>
Sys.sleep(3)
remDr $ close()

当我想使用rvest-package抓取上面创建的字符串向量(html_source)时,我得到一个错误,因为源不是HTML文件:

  kar.links = html_source%>%
read_html(encoding =UTF-8)%>%
html_nodes (#searchResults a)%>%
html_attr(href)

我试图折叠矢量,并试图寻找一个字符串到HTML转换器,但似乎没有任何工作。
我觉得解决方案在于如何将页面源保存在循环中。



编辑:通过这个不太美观的解决方案修正了它:

 links [[i]] 链接<  - 向量(列表,100)
i < - 1
< - html_source [[i]] [[1]]%>%
read_html(encoding =UTF-8)%>%
html_nodes(#searchResults a)%> ;%b $ b html_attr(href)
i < - i + 1
>
col_links < - links%>%
unlist()


解决方案

html_source is一个嵌套的列表:

  str(head(html_source,3))
#3
# $:1
#.. $:chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\><<<头> \ n< title> Karakterfordeling< / title> \\\
< link rel = \icon \| __truncated__
#$:1
# chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\>< head> \
<标题> Karakterfordeling< /标题> \\\
<链路相对= \ icon\ | __truncated__
#$:1
#.. $:chr<!DOCTYPE html>< html xmlns = \的列表http://www.w3.org/1999/xhtml\\ \\< head> \\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon \| __truncated__
pre>

在你的情况中, html_source 由100个元素组成;每个元素本身就是一个包含一个元素的列表,这是一个字符串(和原始的html代码)。因此,要获取每个原始html页面,您需要访问 html_source [[1]] [[1]] html_source [[2]] [[1]] 等等。 $ c> html_source ,你可以这样做: lapply(html_source,`[[`,1) code> remDr $ getPageSource()[[1]] 在循环:

  str(head(html_source,3))
#3
#$:chr<!DOCTYPE html>< html xmlns = \ http://www.w3.org/1999/xhtml\ >< HEAD> \\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon\| | __truncated__
#$:chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\>< head> \\\
< ; title> Karakterfordeling< / title> \\\
< link rel = \icon\| __truncated__
#$:chr<!DOCTYPE html>< html xmlns = \http: //www.w3.org/1999/xhtml\\">< head> ;\\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon\| __truncated__


Building on an answer to a former question of mine I'm scraping this website for links with the Rselenium-package using the following code:

startServer() 
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, 
                  browserName = "chrome")

remDr$open(silent = TRUE)
remDr$navigate("http://karakterstatistik.stads.ku.dk/")
Sys.sleep(2)

webElem <- remDr$findElement("name", "submit")
webElem$clickElement()
Sys.sleep(5)

html_source <- vector("list", 100)
i <- 1
while (i <= 100) {
  html_source[[i]] <- remDr$getPageSource()
  webElem <- remDr$findElement("id", "next")
  webElem$clickElement()
  Sys.sleep(2)
  i <- i + 1
}
Sys.sleep(3)
remDr$close()

When I want to scrape the above created vector of strings (html_source) using the rvest-package I get an error as the source is not an HTML-file:

kar.links = html_source %>% 
  read_html(encoding = "UTF-8") %>% 
  html_nodes("#searchResults a") %>% 
  html_attr("href")

I've tried to collapse the vector and tried to look for a string-to-HTML converter, but nothing seems to work. I feel the solution lies somewhere in how I save the page-sources in the loop.

EDIT: fixed it by this less than beautiful solution:

links <- vector("list", 100)
i <- 1
while (i <= 100) {
links[[i]] <- html_source[[i]][[1]] %>% 
  read_html(encoding = "UTF-8") %>% 
  html_nodes("#searchResults a") %>% 
  html_attr("href") 
i <- i + 1
}
col_links<- links %>% 
unlist()

解决方案

html_source is a nested list:

str(head(html_source, 3))
# List of 3
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ :List of 1
#   ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__

In your case, html_source is made up of 100 elements; each element is itself a list with one element, which is a string (and the raw html code). Therefore, to get each raw html page, you need to access html_source[[1]][[1]], html_source[[2]][[1]], and so on.

To flatten html_source, you can do: lapply(html_source, `[[`, 1). We get the same result if we use remDr$getPageSource()[[1]] in the while loop:

str(head(html_source, 3))
# List of 3
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__
#  $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n    <title>Karakterfordeling</title>\n    <link rel=\"icon\"| __truncated__

这篇关于在R中从字符串向量中刮掉HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆