R内存问题,同时网站搜索与REST [英] R memory issues while webscraping with rvest

查看:150
本文介绍了R内存问题,同时网站搜索与REST的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 rvest 在R中进行webscrape,并且遇到内存问题。我有一个叫做 urls 的字符串的28,625 x 2的数据框,里面包含了我正在抓取的页面的链接。框架的一行包含两个相关的链接。我想用从链接中获取的信息生成一个28,625乘四个数据帧 Final 。一条信息来自第二条链接,另外三条来自第一条链接。三条信息的xpaths作为字符串存储在向量 xpaths 中。

I am using rvest to webscrape in R, and I'm running into memory issues. I have a 28,625 by 2 data frame of strings called urls that contains the links to the pages I'm scraping. A row of the frame contains two related links. I want to generate a 28,625 by 4 data frame Final with information scraped from the links. One piece of information is from the second link in a row, and the other three are from the first link. The xpaths to the three pieces of information are stored as strings in the vector xpaths. I am doing this with the following code:

data <- rep("", 4 * 28625)

k <- 1

for (i in 1:28625) {

  name <- html(urls[i, 2]) %>%
    html_node(xpath = '//*[@id="seriesDiv"]/table') %>%
    html_table(fill = T)

  data[k] <- name[4, 3]

  data[k + 1:3] <- html(urls[i, 1]) %>% 
    html_nodes(xpath = xpaths) %>%
    html_text()

  k <- k + 4

}

dim(data) <- c(4, 28625)
Final <- as.data.frame(t(data))

它工作的很好,但是当我打开任务管理器时,我发现我的内存使用量已经单调增加,目前在大约340次迭代后达到了97%。我只想开始程序,并在一两天后回来,但是在完成任务之前,我的所有RAM都将耗尽。我已经做了一些关于R如何分配内存的研究,并且尽力预先分配内存并进行修改,以防止代码不必要地复制内容等。

It works well enough, but when I open the task manager, I see that my memory usage has been monotonically increasing and is currently at 97% after about 340 iterations. I'd like to just start the program and come back in a day or two, but all of my RAM will be exhausted before the job is done. I've done a bit of research on how R allocates memory, and I've tried my best to preallocate memory and modify in place, to keep the code from making unnecessary copies of things, etc.

为什么这么紧张呢?有什么我可以做的解决它?

Why is this so memory intensive? Is there anything I can do to resolve it?

推荐答案

Rvest已更新以解决此问题。看到这里:

Rvest has been updated to resolve this issue. See here:

http://www.r-bloggers.com/rvest-0-3-0/

这篇关于R内存问题,同时网站搜索与REST的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆