R中的网页抓取? [英] Web scraping in R?

查看:65
本文介绍了R中的网页抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取

请注意,我在右上角选择了一个特定日期.

通过遵循

根据我的理解,我想要获取的数字(例如,对于勇士来说,分别为94%,79%,66%,59%)以不同的方式编码".换句话说,在 web scraping test.csv 中编写的内容不可读.

有什么方法可以将编码数字"转换为常规数字"?

感谢@Alexey的回答和

I would like to web scrape this web site

In particular I would like to take the information that it is in that table:

Please note that I choose a specific date on the upper right corner.

By following this guide

I wrote the following code

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')

#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")

From my understanding the numbers that I want to get ( e.g. For Warriors it would be 94%, 79%, 66%, 59%) are "coded" in a different way. In other words, what it is written in the web scraping test.csv is not readable.

Is there any way that I can transform the "coded numbers" into "regular numbers" ?

解决方案

Thanks to @Alexey answer and this, the following code worked for me

library(RSelenium)
library(rvest)
library(wdman)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
# rD <- rsDriver()
# remDr <- rD$client

pDrv <- phantomjs(port = 4567L)
remDr <- remoteDriver(browserName = "phantomjs", port = 4567L)
remDr$open()
#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
pDrv$stop()

# rD[["server"]]$stop() 


# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df

这篇关于R中的网页抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆