当页面上没有嵌入特定文件时,如何使用R从网页下载文件 [英] How to use R to download a file from webpage when there is no specific file embedded on the page

查看:35
本文介绍了当页面上没有嵌入特定文件时,如何使用R从网页下载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当 R 中没有使用 download.file() 上传特定文件时,是否有任何可能的解决方案从任何网站提取文件.

我有这个网址

https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0

有一个将csv文件导出到我的工作目录的链接,但是当我右键单击网页上的导出数据超链接并选择链接地址时变成了下面的脚本

javascript:__doPostBack('LeaderBoard1$cmdCSV','')

而不是让我访问 csv 文件的 url.

有没有办法解决这个问题.

解决方案

您可以将 RSelenium 用于此类工作.下面的脚本完全适用于我,它也适用于您,并在文本中进行了小的编辑.该解决方案使用两个包:RSelenium 来自动化 Chrome,以及 here 来选择您的活动目录.

library(RSelenium)图书馆(这里)

这是您提供的网址:

url <- paste0("https://www.fangraphs.com/leaders.aspx","?pos=all","&stats=bat","&lg=all","&qual=y","&type=8","&season=2016","&month=0","&season1=2016",&ind=0")

这是下载按钮的 ID.您可以通过右键单击 Chrome 中的按钮并点击检查"来找到它.

button_id <- "LeaderBoard1_cmdCSV"

我们将让 Chrome 自动下载文件,它会转到您的默认下载位置.在脚本的末尾,我们希望将其移动到您的当前目录.所以首先让我们设置文件的名称(根据

点击那个,然后点击你想要的元素:

这会在元素"面板中将其拉起(突出显示).右键单击突出显示的行,然后单击复制选择器".如果您想使用 XPath,也可以单击复制 XPath".

这就给了你你的代码!

buttons <- browser$findElements("#linkAccount > div > div.label-account",using = "css 选择器")按钮[[1]]$clickElement()

繁荣.

Is there any possible solution to extract the file from any website when there is no specific file uploaded using download.file() in R.

I have this url

https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=y&type=8&season=2016&month=0&season1=2016&ind=0

there is a link to export csv file to my working directory, but when i right click on the export data hyperlink on the webpage and select the link address it turns to be the following script

javascript:__doPostBack('LeaderBoard1$cmdCSV','') 

instead of the url which give me access to the csv file.

Is there any solution to tackle this problem.

解决方案

You can use RSelenium for jobs like this. The script below works for me exactly as is, and it should for you as well with minor edits noted in the text. The solution uses two packages: RSelenium to automate Chrome, and here to select your active directory.

library(RSelenium)
library(here)

Here's the URL you provided:

url <- paste0(
  "https://www.fangraphs.com/leaders.aspx",
  "?pos=all",
  "&stats=bat",
  "&lg=all",
  "&qual=y",
  "&type=8",
  "&season=2016",
  "&month=0",
  "&season1=2016",
  "&ind=0"
)

Here's the ID of the download button. You can find it by right-clicking the button in Chrome and hitting "Inspect."

button_id <- "LeaderBoard1_cmdCSV"

We're going to automate Chrome to download the file, and it's going to go to your default download location. At the end of the script we'll want to move it to your current directory. So first let's set the name of the file (per fangraphs.com) and your download location (which you should edit as needed):

filename <- "FanGraphs Leaderboard.csv"
download_location <- file.path(Sys.getenv("USERPROFILE"), "Downloads")

Now you'll want to start a browser session. I use Chrome, and specifying this particular Chrome version (using the chromever argument) works for me. YMMV; check the best way to start a browser session for you.

An rsDriver object has two parts: a server and a browser client. Most of the magic happens in the browser client.

driver <- rsDriver(
  browser = "chrome",
  chromever = "74.0.3729.6"
)
server <- driver$server
browser <- driver$client

Using the browser client, navigate to the page and click that button.

Quick note before you do: RSelenium may start looking for the button and trying to click it before there's anything to click. So I added a few lines to watch for the button to show up, and then click it once it's there.

buttons <- list()
browser$navigate(url)
while (length(buttons) == 0) {
  buttons <- browser$findElements(button_id, using = "id")
}
buttons[[1]]$clickElement()

Then wait for the file to show up in your downloads folder, and move it to the current project directory:

while (!file.exists(file.path(download_location, filename))) {
  Sys.sleep(0.1)
}
file.rename(file.path(download_location, filename), here(filename))

Lastly, always clean up your server and browser client, or RSelenium gets quirky with you.

browser$close()
server$stop()

And you're on your merry way!


Note that you won't always have an element ID to use, and that's OK. IDs are great because they uniquely identify an element and using them requires almost no knowledge of website language. But if you don't have an ID to use, above where I specify using = "id", you have a lot of other options:

  • using = "xpath"
  • using = "css selector"
  • using = "name"
  • using = "tag name"
  • using = "class name"
  • using = "link text"
  • using = "partial link text"

Those give you a ton of alternatives and really allow you to identify anything on the page. findElements will always return a list. If there's nothing to find, that list will be of length zero. If it finds multiple elements, you'll get all of them.

XPath and CSS selectors in particular are super versatile. And you can find them without really knowing what you're doing. Let's walk through an example with the "Sign In" button on that page, which in fact does not have an ID.

Start in Chrome by pretty Control+Shift+J to get the Developer Console. In the upper left corner of the panel that shows up is a little icon for selecting elements:

Click that, and then click on the element you want:

That'll pull it up (highlight it) over in the "Elements" panel. Right-click the highlighted line and click "Copy selector." You can also click "Copy XPath," if you want to use XPath.

And that gives you your code!

buttons <- browser$findElements(
  "#linkAccount > div > div.label-account",
  using = "css selector"
)
buttons[[1]]$clickElement()

Boom.

这篇关于当页面上没有嵌入特定文件时,如何使用R从网页下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆