如何在网络抓取的html表中包含属性 [英] How to include attributes in a web-scraped html table

查看:66
本文介绍了如何在网络抓取的html表中包含属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用rvest从内部网站的HTML表中抓取数据.行的颜色是有意义的,因此我想将 BGCOLOR 属性提取为最终表中的一列,但是当然 html_table()仅提取内容.

I'm using rvest to scrape data from an internal website's HTML tables. The color of the rows is meaningful, so I want to extract the BGCOLOR attribute as a column in my final table, but of course html_table() only extracts the content.

这是我到目前为止所拥有的.以下是html表的代码段.如何添加颜色列?

Here's what I have so far. A snippet of the html table is below. How can I include a column for color?

html_nodes(samplepage,"table")
tbl_content <- samplepage %>%
     html_nodes("table") %>%
     html_table(fill = TRUE, trim = TRUE)
tbl_content


<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl     <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center">    51.5 <td align="center">    32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl        <td>  12.2 <td>   1.7 <td align="center">   7.6 <td align="center">  14.9 <td align="center">    71 <td align="center">    33
</tr>

推荐答案

您可以构建自己的解析器来替换 html_table . purrr :: map_df 在迭代节点(在这种情况下为 tr s)并将结果组合到data.frame中非常方便.

You can build your own parser to replace html_table. purrr::map_df is handy for iterating over nodes (trs in this case) and combining the results into a data.frame:

library(rvest)
library(tidyverse)

html <- '<tr BGCOLOR = "#F8C0E0">
<td> BASOPHILS <td> microl     <td> 0.477 <td> 0.425 <td align="center"> 0.052 <td align="center"> 1.920 <td align="center">    51.5 <td align="center">    32
</tr>
<tr BGCOLOR = "#F8F0B0">
<td> CALCIUM <td > mg/dl        <td>  12.2 <td>   1.7 <td align="center">   7.6 <td align="center">  14.9 <td align="center">    71 <td align="center">    33
</tr>'

parsed_df <- html %>% 
    read_html() %>% 
    html_nodes('tr') %>% 
    map_df(~bind_cols(data_frame(bgcolor = html_attr(.x, 'bgcolor')),    # grab attribute
                      # extract each row's values to 1-row data.frame
                      html_nodes(.x, 'td') %>% 
                          html_text(trim = TRUE) %>% 
                          set_names(paste0('x', seq_along(.))) %>%    # or `%>% t() %>% as_data_frame()`
                          invoke(data_frame, .))) %>% 
    type_convert()    # clean up types

parsed_df
#> # A tibble: 2 x 9
#>   bgcolor        x1     x2     x3    x4    x5    x6    x7    x8
#>     <chr>     <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 #F8C0E0 BASOPHILS microl  0.477 0.425 0.052  1.92  51.5    32
#> 2 #F8F0B0   CALCIUM  mg/dl 12.200 1.700 7.600 14.90  71.0    33

更简单但不太灵活,您只需拉出属性,然后将其合并到 html_table 的结果中即可:

More simply but less flexibly, you can just pull out the attribute and then merge it to the results of html_table:

paste('<table>', html, '</table>') %>%    # `html_table` needs a <table> tag
    read_html() %>% 
    {
        data.frame(bgcolor = html_nodes(., 'tr') %>% html_attr('bgcolor'), 
                   html_table(.))
    }
#>   bgcolor        X1     X2     X3    X4    X5    X6   X7 X8
#> 1 #F8C0E0 BASOPHILS microl  0.477 0.425 0.052  1.92 51.5 32
#> 2 #F8F0B0   CALCIUM  mg/dl 12.200 1.700 7.600 14.90 71.0 33

这篇关于如何在网络抓取的html表中包含属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆