使用XML R包刮取带有图像的html表 [英] Scraping html table with images using XML R package

查看:115
本文介绍了使用XML R包刮取带有图像的html表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用R的XML包来抓取html表,方法与此线程中讨论的类似:

I want to scrape html tables using the XML package of R, in a similar way to discussed on this thread:

使用XML包将html表格制作成R数据框

与我想要提取的数据的主要区别在于,我还想要与html表中的图像相关的文本。例如, http://www.theplantlist.org/tpl/record/kew上的表格-422570 包含置信度列,其中的图像显示为一到三颗星。如果我使用:

The main difference with the data I want to extract, is that I also want text relating to an image in the html table. For example the table at http://www.theplantlist.org/tpl/record/kew-422570 contains a column for "Confidence" with an image showing one to three stars. If I use:


readHTMLTable( http://www.theplantlist.org/tpl/record/kew-422570

然后置信度的输出列除了标题之外是空白的。有没有办法在这个专栏中获得某种形式的数据,例如链接到相应图像的HTML代码?

then the output column for "Confidence" is blank apart from the header. Is there any way to get some form of data in this column, for example the HTML code linking to the appropriate image?

任何关于如何进行此操作的建议都将是非常感谢!

Any suggestions of how to go about this would be much appreciated!

推荐答案

我能够使用 SelectorGadeget

library(XML)
library(RCurl)
d = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-422570"))
path = '//*[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img'

xpathSApply(d, path, xmlAttrs)["src",]

[1] "/img/H.png" "/img/L.png" "/img/H.png" "/img/H.png" "/img/H.png"
[6] "/img/H.png" "/img/H.png"

这篇关于使用XML R包刮取带有图像的html表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆