使用 R 和 Rvest 抓取和提取 XML 站点地图元素 [英] Scraping and extracting XML sitemap elements using R and Rvest

查看:49
本文介绍了使用 R 和 Rvest 抓取和提取 XML 站点地图元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 Rvest 从多个 xml 文件中提取大量 XML 站点地图元素.我已经能够使用 xpaths 从网页中提取 html_nodes,但是对于 xml 文件,这对我来说是新的.

I need to extract a large number of XML sitemap elements from multiple xml files using Rvest. I have been able to extract html_nodes from webpages using xpaths, but for xml files this is new to me.

而且,我找不到可以让我解析 xml 文件地址而不是解析大型文本块的 XML 的 Stackoverflow 问题.

And, I can't find a Stackoverflow question that lets me parse an xml file address, rather than parsing a large text chunk of XML.

我用于 html 的示例:

Example of what I have used for html:

library(dplyr)
library(rvest)

webpage <- "https://www.example.co.uk/"

data <- webpage %>%
  read_html() %>%
  html_nodes("any given node goes here") %>%
  html_text()

我如何调整它以从如下所示的 XML 文件(解析地址)中获取loc"XML 文件元素:

How do I adapt this to take a "loc" XML file element from an XML file (parsing the address) that looks like this:

<urlset>
<url>
<loc>https://www.example.co.uk/</loc>
<lastmod>2020-05-01</lastmod>
<changefreq>always</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://www.example.co.uk/news</loc>
<changefreq>always</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/uk</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/weather</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>
<url>
<loc>https://www.example.co.uk/news/world</loc>
<changefreq>always</changefreq>
<priority>0.5</priority>
</url>

以下是我在 Dave 提供的脚本中所做的更改:

Here is what I have changed in the script kindly provided by Dave:

library(xml2)

#list of files to process
fnames<-c("xml1.xml")

dfs<-lapply(fnames, function(fname) {
  doc<-read_xml(fname)

  #find loc and lastmod
  loc<-trimws(xml_text(xml_find_all(doc, ".//loc")))
  lastmod<-trimws(xml_text(xml_find_all(doc, ".//last")))

  #find all of the nodes/records under the urlset node
  nodes<-xml_children(xml_find_all(doc, ".//urlset"))

  #find the sub nodes names and values
  nodenames<-xml_name(nodes)
  nodevalues<-trimws(xml_text(nodes))

  #make data frame of all the values
  df<-data.frame(file=fname, loc=loc, lastmod=lastmod, node.names=nodenames, 
                 values=nodevalues, stringsAsFactors = FALSE, nrow(0))

})

#Make one long df
longdf<-do.call(rbind, dfs)

#make into a wide format
library(tidyr)
finalanswer<-spread(longdf, key=node.names, value=values)

推荐答案

由于每个 url 节点的子节点数量不同是一种可行的方法:

Since the number of children per url node is different is a working approach:

file<-read_xml(text)

library(dplyr)

#find parent nodes
parents <-xml_find_all(file, ".//url")

#parse each child
dfs<-lapply(parents, function(node){
  #Find all children
  nodes <- xml_children(node)

  #get node name and value
  nodenames<-  xml_name(nodes)
  values <- xml_text(nodes)

  #made data frame with results
  df<- as.data.frame(t(values), stringsAsFactors=FALSE)
  names(df)<-nodenames
  df
})

#Make find answer
answer<-bind_rows(dfs)

由于您有多个文件,您可以将脚本包含在外循环中以循环遍历文件列表.当然是循环中的循环,因此如果有大量文件和每个文件中的大量父节点,则性能会受到影响.

Since you have multiple files, you could enclose the script in an outer loop to cycle the through the file list. Of course is a loop within a loop thus performance will suffer if there is a large number of files and a large number of parent nodes in each file.

替代方案:如果子节点的数量很少,那么最好直接解析它们,避免上面的 lapply 循环.

Alternative: If the number of children nodes are short then it is best to parse them directly and avoid the above lapply loop.

loc <- xml_find_first(parents, ".//loc") %>% xml_text()
lastmod <- xml_find_first(parents, ".//lastmod") %>% xml_text()
changefreq <- xml_find_first(parents, ".//changefreq") %>% xml_text()
priority <- xml_find_first(parents, ".//priority") %>% xml_text()

answer<-data.frame(loc, lastmod, chargefreq, priority)

这篇关于使用 R 和 Rvest 抓取和提取 XML 站点地图元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆