从R中的网站提取html表 [英] Extracting html table from a website in R

查看：214 发布时间：2018/7/6 16:44:26 r html-table rvest

本文介绍了从R中的网站提取html表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

您好我正在尝试从 premierleague 网站中提取表格。

Hi I am trying to extract the table from the premierleague website.

我使用的包是 rvest 包，我在初始阶段使用的代码如下：

The package I am using is rvest package and the code I am using in the inital phase is as follows:

library(rvest)
library(magrittr)
premierleague <- read_html("https://fantasy.premierleague.com/a/entry/767830/history")
premierleague %>% html_nodes("ism-table")

我找不到一个html标签，可以为rvest包提取 html_nodes 。

I couldn't find a html tag that would work to extract the html_nodes for rvest package.

我使用类似的方法从 http://admissions.calpoly.edu/prospective/profile.html 我能够提取数据。我用于calpoly的代码如下：

I was using similar approach to extract data from "http://admissions.calpoly.edu/prospective/profile.html" and I was able to extract the data. The code I used for calpoly is as follows:

library(rvest)
library(magrittr)
CPadmissions <- read_html("http://admissions.calpoly.edu/prospective/profile.html")

CPadmissions %>% html_nodes("table") %>%
  .[[1]] %>%
  html_table()

获取代码以上来自youtube通过此链接： https://www.youtube.com / watch？v = gSbuwYdNYLM& ab_channel = EvanO％27Brien

Got the code above from youtube through this link: https://www.youtube.com/watch?v=gSbuwYdNYLM&ab_channel=EvanO%27Brien

任何有关从fantasy.premierleague.com获取数据的帮助都非常感谢。我需要使用某种API吗？

Any help on getting data from fantasy.premierleague.com is highly appreciated. Do I need to use some kind of API ?

推荐答案

由于数据是用JavaScript加载的，用rvest抓取HTML不会得到你想要的东西，但如果你使用PhantomJS作为RSelenium中的无头浏览器，并不是那么复杂（按照RSelenium标准）：

Since the data is loaded with JavaScript, grabbing the HTML with rvest will not get you what you want, but if you use PhantomJS as a headless browser within RSelenium, it's not all that complicated (by RSelenium standards):

library(RSelenium)
library(rvest)

# initialize browser and driver with RSelenium
ptm <- phantom()
rd <- remoteDriver(browserName = 'phantomjs')
rd$open()

# grab source for page
rd$navigate('https://fantasy.premierleague.com/a/entry/767830/history')
html <- rd$getPageSource()[[1]]

# clean up
rd$close()
ptm$stop()

# parse with rvest
df <- html %>% read_html() %>% 
    html_node('#ismr-event-history table.ism-table') %>% 
    html_table() %>% 
    setNames(gsub('\\S+\\s+(\\S+)', '\\1', names(.))) %>%    # clean column names
    setNames(gsub('\\s', '_', names(.)))

str(df)
## 'data.frame':    20 obs. of  10 variables:
##  $ Gameweek                : chr  "GW1" "GW2" "GW3" "GW4" ...
##  $ Gameweek_Points         : int  34 47 53 51 66 66 65 63 48 90 ...
##  $ Points_Bench            : int  1 6 9 7 14 2 9 3 8 2 ...
##  $ Gameweek_Rank           : chr  "2,406,373" "2,659,789" "541,258" "905,524" ...
##  $ Transfers_Made          : int  0 0 2 0 3 2 2 0 2 0 ...
##  $ Transfers_Cost          : int  0 0 0 0 4 4 4 0 0 0 ...
##  $ Overall_Points          : chr  "34" "81" "134" "185" ...
##  $ Overall_Rank            : chr  "2,406,373" "2,448,674" "1,914,025" "1,461,665" ...
##  $ Value                   : chr  "£100.0" "£100.0" "£99.9" "£100.0" ...
##  $ Change_Previous_Gameweek: logi  NA NA NA NA NA NA ...

与往常一样，需要进行更多的清洁工作，但总的来说，如果没有太多工作，它的状态会非常好。（如果你正在使用tidyverse， df％>％mutate_if（is.character，parse_number）会做得很好。）箭头是图像，这就是为什么最后一个列全部是 NA ，但无论如何你都可以计算出来。

As always, more cleaning is necessary, but overall, it's in pretty good shape without too much work. (If you're using the tidyverse, df %>% mutate_if(is.character, parse_number) will do pretty well.) The arrows are images which is why the last column is all NA, but you can calculate those anyway.

这篇关于从R中的网站提取html表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从R中的网站提取html表 [英] Extracting html table from a website in R

问题描述

推荐答案

相关文章

HTML/CSS最新文章

热门教程

热门工具

登录关闭

从R中的网站提取html表 [英] Extracting html table from a website in R

问题描述

推荐答案

相关文章

HTML/CSS最新文章

热门教程

热门工具

登录 关闭

登录关闭