从html表中提取链接 [英] Extract links from html table

查看:139
本文介绍了从html表中提取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从以下网页中提取链接 http://ipt.humboldt.org.co / 是样本类型。我可以使用以下代码从网页获取表:

 图书馆(XML)
sitePage< -htmlParse( http://ipt.humboldt.org.co/)
tableNodes< -getNodeSet(sitePage,// table)
siteTable< -readHTMLTable(tableNodes [[1]])

但是,在使用readHTML命令后,链接丢失。



  library(XML )
sitePage< -htmlParse(http://ipt.humboldt.org.co/)
hyperlinksYouNeed< -getNodeSet(sitePage,// table [@ id ='resourcestable']
//td[5][.='Specimen']
/ before-sibling
:: td [3]
/ a
/ @ href)

但让我解释一下XPath expr点点滴滴:


  • // table [@ id ='resourcestable'] - >这样我们就可以在页面上获得一个名为'resourcestable'的主表。


  • // td [5] [。='样本'] - >现在我们只过滤这些Type为样本

  • 的行。
  • / prior-sibling - >现在我们开始向后看


  • :: td [3] - > 3步从我们所在的位置向后精确计数。小心之前的兄弟姐妹开始向后计数,因此td [1]是类型列,td [2]是组织列,td [ 3]是我们想要的 列。

  • >现在获取包含的 a 节点
  • / @ href - >最后更准确地说是href属性内容。



I'm trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type "Specimen". I can get the table from the webpage using the following code:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
tableNodes<-getNodeSet(sitePage,"//table")
siteTable<-readHTMLTable(tableNodes[[1]])

However the links are missing after I use the readHTML command.

解决方案

It ended up being an intricate XPath expression:

library(XML)
sitePage<-htmlParse("http://ipt.humboldt.org.co/")
hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                        //td[5][.='Specimen']
                                        /preceding-sibling
                                        ::td[3]
                                        /a
                                        /@href")

but let me explain the XPath expression bit-by-bit:

  • //table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'

  • //td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen

  • /preceding-sibling -> Now we start looking backwards

  • ::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.

  • /a -> now get the included a node

  • /@href -> and finally more precisely the href attribute content

这篇关于从html表中提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆