在R中从字符串向量中刮掉HTML [英] Scraping HTML from vector of strings in R
问题描述
建立在我的前一个问题的答案上我在抓取本网站中使用以下代码链接Rselenium包:
(remoteServerAddr =localhost,port = 4444,browserName =chrome)
remDr $ open(silent = TRUE)
remDr $ navigate(http://karakterstatistik.stads.ku.dk/)
Sys.sleep(2)
webElem< ; - remDr $ findElement(name,submit)
webElem $ clickElement()
Sys.sleep(5)
html_source< - vector(list ,100)
i < - 1
while(i <= 100){
html_source [[i]]< - remDr $ getPageSource()
webElem< - remDr $ findElement(id,next)
webElem $ clickElement()
Sys.sleep(2)
i < - i + 1
>
Sys.sleep(3)
remDr $ close()
当我想使用rvest-package抓取上面创建的字符串向量(html_source)时,我得到一个错误,因为源不是HTML文件:
kar.links = html_source%>%
read_html(encoding =UTF-8)%>%
html_nodes (#searchResults a)%>%
html_attr(href)
我试图折叠矢量,并试图寻找一个字符串到HTML转换器,但似乎没有任何工作。
我觉得解决方案在于如何将页面源保存在循环中。
编辑:通过这个不太美观的解决方案修正了它:
链接< - 向量(列表,100)
i < - 1
< - html_source [[i]] [[1]]%>%
read_html(encoding =UTF-8)%>%
html_nodes(#searchResults a)%> ;%b $ b html_attr(href)
i < - i + 1
>
col_links < - links%>%
unlist()
html_source
is一个嵌套的列表:
str(head(html_source,3))
pre>
#3
# $:1
#.. $:chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\><<<头> \ n< title> Karakterfordeling< / title> \\\
< link rel = \icon \| __truncated__
#$:1
# chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\>< head> \
<标题> Karakterfordeling< /标题> \\\
<链路相对= \ icon\ | __truncated__
#$:1
#.. $:chr<!DOCTYPE html>< html xmlns = \的列表http://www.w3.org/1999/xhtml\\ \\< head> \\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon \| __truncated__
在你的情况中,
html_source
由100个元素组成;每个元素本身就是一个包含一个元素的列表,这是一个字符串(和原始的html代码)。因此,要获取每个原始html页面,您需要访问html_source [[1]] [[1]]
,html_source [[2]] [[1]]
等等。 $ c> html_source ,你可以这样做:lapply(html_source,`[[`,1)
code> remDr $ getPageSource()[[1]] 在中
循环:str(head(html_source,3))
#3
#$:chr<!DOCTYPE html>< html xmlns = \ http://www.w3.org/1999/xhtml\ >< HEAD> \\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon\| | __truncated__
#$:chr<!DOCTYPE html>< html xmlns = \http://www.w3.org/1999/xhtml\>< head> \\\
< ; title> Karakterfordeling< / title> \\\
< link rel = \icon\| __truncated__
#$:chr<!DOCTYPE html>< html xmlns = \http: //www.w3.org/1999/xhtml\\">< head> ;\\\
< title> Karakterfordeling< / title> \\\
< link rel = \icon\| __truncated__
Building on an answer to a former question of mine I'm scraping this website for links with the Rselenium-package using the following code:
startServer() remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "chrome") remDr$open(silent = TRUE) remDr$navigate("http://karakterstatistik.stads.ku.dk/") Sys.sleep(2) webElem <- remDr$findElement("name", "submit") webElem$clickElement() Sys.sleep(5) html_source <- vector("list", 100) i <- 1 while (i <= 100) { html_source[[i]] <- remDr$getPageSource() webElem <- remDr$findElement("id", "next") webElem$clickElement() Sys.sleep(2) i <- i + 1 } Sys.sleep(3) remDr$close()
When I want to scrape the above created vector of strings (html_source) using the rvest-package I get an error as the source is not an HTML-file:
kar.links = html_source %>% read_html(encoding = "UTF-8") %>% html_nodes("#searchResults a") %>% html_attr("href")
I've tried to collapse the vector and tried to look for a string-to-HTML converter, but nothing seems to work. I feel the solution lies somewhere in how I save the page-sources in the loop.
EDIT: fixed it by this less than beautiful solution:
links <- vector("list", 100) i <- 1 while (i <= 100) { links[[i]] <- html_source[[i]][[1]] %>% read_html(encoding = "UTF-8") %>% html_nodes("#searchResults a") %>% html_attr("href") i <- i + 1 } col_links<- links %>% unlist()
解决方案
html_source
is a nested list:str(head(html_source, 3)) # List of 3 # $ :List of 1 # ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__ # $ :List of 1 # ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__ # $ :List of 1 # ..$ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__
In your case,
html_source
is made up of 100 elements; each element is itself a list with one element, which is a string (and the raw html code). Therefore, to get each raw html page, you need to accesshtml_source[[1]][[1]]
,html_source[[2]][[1]]
, and so on.To flatten
html_source
, you can do:lapply(html_source, `[[`, 1)
. We get the same result if we useremDr$getPageSource()[[1]]
in thewhile
loop:str(head(html_source, 3)) # List of 3 # $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__ # $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__ # $ : chr "<!DOCTYPE html><html xmlns=\"http://www.w3.org/1999/xhtml\"><head>\n <title>Karakterfordeling</title>\n <link rel=\"icon\"| __truncated__
这篇关于在R中从字符串向量中刮掉HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!