使用 xpathSApply 在 R 中抓取 XML 属性 [英] Using xpathSApply to scrape XML attributes in R

查看:26
本文介绍了使用 xpathSApply 在 R 中抓取 XML 属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 xpathSApply(在 XML 包中)在 R 中抓取 XML,但在提取属性时遇到问题.

首先,一个相关的 XML 片段:

 

<a href="http://www.somesite.com" itemprop="name">花式产品</a>

我已使用以下方法成功提取了花式产品"(即元素?):

Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue)

这花了一些时间(我是 n00b),但文档很好,这里有几个我可以利用的已回答问题.我不知道如何拉出http://www.somesite.com"(属性?).我推测它涉及将第三个术语从xmlValue"更改为xmlGetAttr",但我可能完全不知道.

仅供参考 (1) 还有 2 个父 <div> 上面我粘贴的代码段和 (2) 这里是缩写的完整代码(我认为不相关,但为了完整起见包括在内)是:

库(XML)图书馆(httr)content2 = paste(readLines(file.choose()), collapse = "\n") # 用户将选择文件.parsedHTML = htmlParse(content2,asText=TRUE)产品 <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue)

解决方案

href 是一个属性.您可以选择适当的节点 //div/a 并使用 xmlGetAttr 函数和 name = href:

'

<a href="http://www.somesite.com" itemprop="name">花式产品</a>

'->数据图书馆(XML)解析的HTML <- xmlParse(xData)产品 <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue)hrefs <- xpathSApply(parsedHTML, "//div/a", xmlGetAttr, 'href')>hrefs[1] "http://www.somesite.com"

I am scraping XML in R using xpathSApply (in the XML package) and having trouble pulling attributes out.

First, a relevant snippet of XML:

 <div class="offer-name">
        <a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
      </div>

I have successfully pulled the 'Fancy Product' (i.e. element?) using:

Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 

That took some time (I'm a n00b), but the documentation is good and there are several answered questions here I was able to leverage. I can't figure out how to pull the "http://www.somesite.com" out though (attribute?). I've speculated that it involves changing the 3rd term from 'xmlValue' to 'xmlGetAttr' but I could be totally off.

FYI (1) There are 2 more parent < div> above the snippet I pasted and (2) here is the abbreviated complete-ish code (which I don't think is relevant but included for the sake of completeness) is:

library(XML)
library(httr)

content2 = paste(readLines(file.choose()), collapse = "\n") # User will select file.
parsedHTML = htmlParse(content2,asText=TRUE)

Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 

解决方案

The href is an attribute. You can select the appropriate node //div/a and use the xmlGetAttr function with name = href:

'<div class="offer-name">
  <a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
  </div>' -> xData
library(XML)
parsedHTML <- xmlParse(xData)
Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue) 
hrefs <- xpathSApply(parsedHTML, "//div/a", xmlGetAttr, 'href')
> hrefs
[1] "http://www.somesite.com"

这篇关于使用 xpathSApply 在 R 中抓取 XML 属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆