在包含特定图标的 html 表格中查找单元格 [英] Find cell in html table containing a specific icon
问题描述
我正在寻找可以告诉我特定图标驻留在 html 表的哪个单元格中的代码.这是我正在使用的内容:
I am looking for code that can inform me in which cell of an html table a particular icon resides. Here is what I am working with:
u <- "http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1"
doc <- rvest::html(u)
tab <- rvest::html_table(doc, fill = TRUE)[[6]]
Pos"列.指定球员在场上的位置.其中一些有一个额外的图标.我可以在页面上看到这些图标的存在如下:
The column "Pos." designates the player's position in the field. Some of these have an additional icon. I can see the presence of these icons on the page as follows:
rvest::html_nodes(doc, ".kapitaenicon-table")
但这并没有告诉我它们在哪里.我希望我的代码返回图标出现在表中位置列"的第 2、10、11、27 行.我该怎么做?
but this doesn't tell me WHERE they are. I would like my code to return that the icon occurs in rows 2, 10, 11, 27 of the "Pos. column" in the table. How can I do that?
推荐答案
多一点 rvest
和 XPath 魔法可以为您提供索引:
A little bit more rvest
and XPath magic can get you the indices:
library(rvest)
library(magrittr)
library(XML)
pg <- html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg %>%
html_nodes("table") %>%
extract2(6) %>%
html_nodes("tbody > tr") %>%
sapply(function(x) {
length(xpathSApply(x, "./td[8]/span[@class='kapitaenicon-table icons_sprite']")) == 1
}) %>% which
## [1] 2 10 11 27
得到第 6 个表,提取 tr
s 然后通过它们查找第 8 个 td
和正确的 span
/类
在其中.如果 XPath 搜索失败,它会返回一个空列表,因此您可以使用长度来确定哪些行具有带有图标的 td
,哪些没有.
That gets the 6th table, extracts the tr
s then looks through them for an 8th td
with the proper span
/class
in it. If the XPath search fails it returns an empty list, so you can use the length to determine which rows have the td
with the icon in them and which do not.
这个:
pg %>%
html_nodes(xpath="//table[6]/tbody/tr/td[8]") %>%
xmlSApply(xpathApply, "boolean(./span[@class='kapitaenicon-table icons_sprite'])") %>%
which
也可以工作,而且它更紧(更快).它使用 XPath boolean
操作来测试是否存在.如果您没有在节点上执行其他操作,这会更方便.
also works and it a bit tighter (and faster). It uses the XPath boolean
operation to test for existence. This is handier if you have no other operations to perform on the node(s).
这是一个 xml2
版本,但我不得不相信在 xml2
中必须有更好的方法来做到这一点:
This is an xml2
version, though I have to believe there has to be a better way to do this in xml2
:
library(xml2)
library(magrittr)
pg2 <- read_html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg2 %>%
xml_find_all("//table[6]/tbody/tr/td[8]") %>%
as_list %>%
sapply(function(x) {
inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
}) %>% which
更新
对于 xml2
的 0.1.0.9000
版本,我必须执行以下操作:
For version 0.1.0.9000
of xml2
I had to do the following:
pg2 %>% xml_find_all("//table") %>%
as_list %>%
extract2(6) %>%
xml_find_all("./tbody/tr/td[8]") %>%
as_list %>%
sapply(function(x) {
inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
}) %>% which
事实并非如此,我已经提交了错误报告.
That should not be the case and I've filed a bug report.
Session info -------------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, darwin13.4.0
ui RStudio (0.99.441)
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages -----------------------------------------------------------------------------
package * version date source
curl * 0.5 2015-02-01 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
Rcpp * 0.11.5 2015-03-06 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
xml2 0.1.0 2015-04-20 CRAN (R 3.2.0)
这篇关于在包含特定图标的 html 表格中查找单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!