使用rvest刮擦跨度的html表 [英] Scraping html table with span using rvest

查看：126 发布时间：2018/7/6 16:51:08 r web-scraping html-table rvest

本文介绍了使用rvest刮擦跨度的html表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用rvest在下一页中提取表格：

I'm using rvest to extract the table in the following page:

https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin

以下代码有效：

URL <- 'https://en.wikipedia.org/wiki/List_of_United_States_presidential_elections_by_popular_vote_margin'
table <- URL %>%  
  read_html %>% 
  html_nodes("table")  %>% 
  .[[2]] %>% 
  html_table(trim=TRUE)

但是边距和总裁名称列有一些奇怪的值。原因是源代码具有以下内容：

but the column of margins and president names have some strange values. The reason is that the source code have the following:

<td><span style="display:none">00.001</span>−10.44%</td>

所以不是得到-10.44％而是得到00.001'10.44％

so instead of getting -10.44% I get 00.001âˆ’10.44%

我怎么能解决这个问题？

How could I fix this?

推荐答案

一个选项是单独定位和替换问题列。

One option is to target and replace the problem columns individually.

可以使用 xpath

# get the html
html <- URL %>%  
  read_html()

# Example using the first margin column (column # 6)
html %>%
  html_nodes(xpath = '//table[2]') %>%       # get table 2
  html_nodes(xpath = '//td[6]/text()') %>%   # get column 6 using text()
  iconv("UTF-8", "UTF-8")                    # to convert "âˆ’" to "-"
# [1] "−10.44%" "−3.00%"  "−0.83%"  "−0.51%"  "0.09%"   "0.17%"   "0.57%"  
# [8] "0.70%"   "1.45%"   "2.06%"   "2.46%"   "3.01%"   "3.12%"   "3.86%"  
#[15] "4.31%"   "4.48%"   "4.79%"   "5.32%"   "5.56%"   "6.05%"   "6.12%"  
#[22] "6.95%"   "7.27%"   "7.50%"   "7.72%"   "8.51%"   "8.53%"   "9.74%"  
#[29] "9.96%"   "10.08%"  "10.13%"  "10.85%"  "11.80%"  "12.20%"  "12.25%" 
#[36] "14.20%"  "14.44%"  "15.40%"  "17.41%"  "17.76%"  "17.81%"  "18.21%" 
#[43] "18.83%"  "22.58%"  "23.15%"  "24.26%"  "25.22%"  "26.17%"

对另一个边距列执行相同操作。我使用 iconv 将â转换为 - ，因为它是一个编码问题，但您可以使用基于替换的解决方案（例如，使用 sub ）。

Do the same for the other margin column. I used iconv to convert the âˆ’ to -, as it's an encoding issue, but you could use a substitution based solution instead (e.g. using sub).

要使用总裁名称来定位列，您可以再次使用xpath：

To target column with president names, you can use xpath again:

html %>%
  html_nodes(xpath = '//table[2]') %>% 
  html_nodes(xpath = '//td[3]/a/text()') %>%
  html_text()
# [1] "John Quincy Adams"      "Rutherford Hayes"       "Benjamin Harrison"     
# [4] "George W. Bush"         "James Garfield"         "John Kennedy"          
# [7] "Grover Cleveland"       "Richard Nixon"          "James Polk"            
#[10] "Jimmy Carter"           "George W. Bush"         "Grover Cleveland"      
#[13] "Woodrow Wilson"         "Barack Obama"           "William McKinley"      
#[16] "Harry Truman"           "Zachary Taylor"         "Ulysses Grant"         
#[19] "Bill Clinton"           "William Henry Harrison" "William McKinley"      
#[22] "Franklin Pierce"        "Barack Obama"           "Franklin Roosevelt"    
#[25] "George H. W. Bush"      "Bill Clinton"           "William Taft"          
#[28] "Ronald Reagan"          "Franklin Roosevelt"     "Abraham Lincoln"       
#[31] "Abraham Lincoln"        "Dwight Eisenhower"      "Ulysses Grant"         
#[34] "James Buchanan"         "Andrew Jackson"         "Martin Van Buren"      
#[37] "Woodrow Wilson"         "Dwight Eisenhower"      "Herbert Hoover"        
#[40] "Franklin Roosevelt"     "Andrew Jackson"         "Ronald Reagan"         
#[43] "Theodore Roosevelt"     "Lyndon Johnson"         "Richard Nixon"         
#[46] "Franklin Roosevelt"     "Calvin Coolidge"        "Warren Harding"

这篇关于使用rvest刮擦跨度的html表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用rvest刮擦跨度的html表 [英] Scraping html table with span using rvest

问题描述

推荐答案

相关文章

HTML/CSS最新文章

热门教程

热门工具

登录关闭

使用rvest刮擦跨度的html表 [英] Scraping html table with span using rvest

问题描述

推荐答案

相关文章

HTML/CSS最新文章

热门教程

热门工具

登录 关闭

登录关闭