从html表中提取链接 [英] Extract links from html table

查看：139 发布时间：2018/6/14 19:38:36 html xml r web-scraping

本文介绍了从html表中提取链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图从以下网页中提取链接 http://ipt.humboldt.org.co / 是样本类型。我可以使用以下代码从网页获取表：

 图书馆（XML）
 sitePage< -htmlParse（ http://ipt.humboldt.org.co/）
 tableNodes< -getNodeSet（sitePage，// table）
 siteTable< -readHTMLTable（tableNodes [[1]]）

但是，在使用readHTML命令后，链接丢失。

library（XML ） sitePage< -htmlParse（http://ipt.humboldt.org.co/） hyperlinksYouNeed< -getNodeSet（sitePage，// table [@ id ='resourcestable'] //td[5][.='Specimen'] / before-sibling :: td [3] / a / @ href）
但让我解释一下XPath expr点点滴滴：

// table [@ id ='resourcestable'] - >这样我们就可以在页面上获得一个名为'resourcestable'的主表。

// td [5] [。='样本'] - >现在我们只过滤这些Type为样本
的行。

/ prior-sibling - >现在我们开始向后看

:: td [3] - > 3步从我们所在的位置向后精确计数。小心之前的兄弟姐妹开始向后计数，因此td [1]是类型列，td [2]是组织列，td [ 3]是我们想要的列。

>现在获取包含的 a 节点

/ @ href - >最后更准确地说是href属性内容。

I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code:
library(XML) sitePage<-htmlParse("http://ipt.humboldt.org.co/") tableNodes<-getNodeSet(sitePage,"//table") siteTable<-readHTMLTable(tableNodes[[1]])
However the links are missing after I use the readHTML command.
解决方案
It ended up being an intricate XPath expression:
library(XML) sitePage<-htmlParse("http://ipt.humboldt.org.co/") hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable'] //td[5][.='Specimen'] /preceding-sibling ::td[3] /a /@href")
but let me explain the XPath expression bit-by-bit:

//table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'

//td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen

/preceding-sibling -> Now we start looking backwards

::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.

/a -> now get the included a node

/@href -> and finally more precisely the href attribute content

这篇关于从html表中提取链接的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从html表中提取链接 [英] Extract links from html table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

从html表中提取链接 [英] Extract links from html table

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭