从 html 表中提取链接 [英] Extract links from html table

查看：56 发布时间：2021/12/17 14:03:46 html xml r web-scraping

本文介绍了从 html 表中提取链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从以下网页中提取链接 http://ipt.humboldt.org.co/ 是标本"类型的.我可以使用以下代码从网页中获取表格:

库(XML)sitePage<-htmlParse("http://ipt.humboldt.org.co/")tableNodes<-getNodeSet(sitePage,"//table")站点表<-readHTMLTable(tableNodes[[1]])

但是在我使用 readHTML 命令后链接丢失了.

解决方案

它最终变成了一个复杂的 XPath 表达式:

库(XML)sitePage<-htmlParse("http://ipt.humboldt.org.co/")超链接YouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']//td[5][.='标本']/前兄弟::td[3]/一个/@href")

但让我一点一点地解释 XPath 表达式:

//table[@id='resourcestable'] -> 这样我们就得到了页面上名为resourcestable"的主表
//td[5][.='Specimen'] -> 现在我们只过滤类型为 Specimen
/preceding-sibling -> 现在我们开始向后看
::td[3] -> 从我们所在的位置向后精确计数的 3 个步骤.小心 preceding-sibling 开始向后计数，因此 td[1] 是 Type 列，td[2] 是 Organisation 列和 td[3] 是我们想要的 Name 列.
/a -> 现在获取包含的 a 节点
/@href -> 最后更准确的 href 属性内容

I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
tableNodes<-getNodeSet(sitePage,"//table")
siteTable<-readHTMLTable(tableNodes[[1]])

However the links are missing after I use the readHTML command.

解决方案

It ended up being an intricate XPath expression:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")

but let me explain the XPath expression bit-by-bit:

//table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'
//td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen
/preceding-sibling -> Now we start looking backwards
::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.
/a -> now get the included a node
/@href -> and finally more precisely the href attribute content

这篇关于从 html 表中提取链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 html 表中提取链接 [英] Extract links from html table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从 html 表中提取链接 [英] Extract links from html table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭