如何用rvest过滤掉节点? [英] How to filter out nodes with rvest?

查看:40
本文介绍了如何用rvest过滤掉节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R rvest 库来读取包含表格的 html 页面.不幸的是,这些表格的列数不一致.

I am using the R rvest library to read an html page containing tables. Unfortunately the tables have inconsistent number of columns.

这是我阅读的表格示例:

Here is an example of the table I read:

<table>
    <tr class="alt">
        <td>1</td>
        <td>2</td>
        <td class="hidden">3</td>
   </tr>
   <tr class="tr0 close notule">
        <td colspan="9">4</td>
    </tr>
</table>

和我在 R 中读取表格的代码:

and my code to read the table in R:

require(rvest)
url = "table.html"
x <- read_html(url)
(x %>% html_nodes("table")) %>% html_table(fill=T)
# [[1]]
  # X1 X2 X3 X4 X5 X6 X7 X8 X9
# 1  1  2  3 NA NA NA NA NA NA
# 2  4  4  4  4  4  4  4  4  4

我想避免考虑隐藏类的 td 和类 'tr0 close notule' 的 tr,所以我只能得到如下表:

I would like to avoid considering the td of class hidden and the tr of class 'tr0 close notule', so that I ony get a table as follows:

  X1 X2
   1  2

有没有办法用 rvest 做到这一点?

Is there a way to do that with rvest?

推荐答案

通过使用 xml_remove(),你可以从字面上删除那些节点

By using xml_remove(), you can literally remove those nodes

text <- '<table>
    <tr class="alt">
        <td>1</td>
        <td>2</td>
        <td class="hidden">3</td>
   </tr>
   <tr class="tr0 close notule">
        <td colspan="9">4</td>
    </tr>
</table>'

html_tree <- read_html(text)

#select nodes you want to remove
hidden_nodes <- html_tree %>%
    html_nodes(".hidden")
close_nodes <- html_tree %>%
    html_nodes(".tr0.close.notule")

#remove those nodes
xml_remove(hidden_nodes)
xml_remove(close_nodes)


html_tree %>%
    html_table()

我检查了 css,发现这些节点的 css 类似于:tr0 close notule,tr1 close notule,...,tr{n} close notule,所以你需要更一般地选择所有这些节点

I examined the css and found that the css for those nodes are like: tr0 close notule,tr1 close notule,...,tr{n} close notule, so you need to be more general to select all those nodes

library(purrr)
library(rvest)

my_session <- html_session("https://www.zone-turf.fr/cheval/greaty-lion-592958/")

#select all table nodes
tables <- my_session %>%
    html_nodes(".inner2 > table")

# remove nodes with class of "close and notule"
close_nodes <- tables %>%
    html_nodes(".close.notule")
xml_remove(close_nodes)

#use map to create all tables and store them in a list.
map(tables,~ .x %>% html_table())

调整要删除的节点的 css 选择器后,它应该可以工作:

After adjusting the css selector for nodes to remove, it should work:

#sample output --------------
[[8]]
RangRg        Cheval S/A CordeC PoidsPds      Jockey Cote   Ecart   
1      1        Latita  F2      3     57,5  T. Piccone  6.3 1'56"05 NA
2      2 Youmzain Star  M2      6       59    F. Veron  4.7     3/4 NA
3      3      Pharrell  M2      1       59 J.B. Eyquem  1.9       1 NA
4      4   King Bubble  M2      4       58   N. Perret 15.5       1 NA
5      5     Dark Side  M2      5       57  A. Hamelin 12.4       8 NA
6      6   Greaty Lion  F2      2     57,5  F. Blondel  6.8      15 NA

[[9]]
  RangRg             Cheval S/A CordeC PoidsPds      Jockey Cote   Ecart     
1      1        Marianafoot  M2      4       59   N. Perret  2.1 1'40"25 lire
2      2 Ballet de la Reine  F2      2       54 H. Journiac  3.4   1 1/2     
3      3        Greaty Lion  F2      5     57,5  F. Blondel  7.0       2     
4      4      Beau Massagot  M2      6       54  E. Cieslik  9.7       5     
5      5        London Look  M2      3       58  T. Piccone  8.8     5,5     
6      6       Spirit Louve  F2      1       53   L. Grosso 18.8    Tête     

[[10]]
   RangRg          Cheval S/A CordeC PoidsPds        Jockey Cote    Ecart   
1       1     Greaty Lion  F2     12       58    F. Blondel  3.6  1'43"84 NA
2       2         Maeghan  F2     11       58     G. Millet  5.1        1 NA
3       3 Neige Eternelle  F2      8       58    A. Roussel  3.8    1 3/4 NA
4       4    Fair la Joie  F2      9       58     G. Congiu 11.6      1/4 NA
5       5     Nicky Green  F2      7       58     R. Thomas  6.4      1/4 NA
6       6   Coral Slipper  F2      5       58  A. Fouassier 28.4    1 1/4 NA
7       7  Gaia de Cerisy  F2     13       58      D. Breux 32.5        1 NA
8       8      Luna Riska  F2      1       58 N. Larenaudie 58.3 Encolure NA
9       9   Belle Vendome  F2      2       56  A. Teissieux 49.9    2 1/2 NA
10      0     Rebel Dream  F2      3       58      S. Leger 56.6          NA
11      0      Facinateur  F2      4       56      M. Berto 21.2          NA
12      0        Giovanna  F2     10       56    F. Garnier 27.8          NA

这篇关于如何用rvest过滤掉节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆