将R与tidyquant和Massiv数据一起使用 [英] Using R with tidyquant and massiv data

查看:119
本文介绍了将R与tidyquant和Massiv数据一起使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用R时,我遇到了一个奇怪的问题: 我正在按照以下方式处理日期: 从数据库中读取数据到数据帧中,填充缺失值,将数据分组并嵌套到组合的主键中,创建时间序列并针对每个组进行预测,然后对数据进行分组和清理,然后将其写回到数据库中.

While working with R I encountered a strange problem: I am processing date in the follwing manner: Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.

类似这样: https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html

对于小型数据集,这就像一个魅力,但是对于较大的数据集(超过100000个条目),我确实从R-Studio获得了"R Session Aborted"屏幕,而nativ R GUI只是停止执行并内爆. 在我研究过的每个日志文件中都没有信息.我怀疑这是某种(泄漏)内存问题.

For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes. There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.

作为一种变通方法,我正在使用for循环按块处理数据.但是无论块大小有多小,我都会看到"R Session Aborted"屏幕,看起来很像是内存泄漏. 整个日期约有500万行.

As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory. The whole date consist of about 5 million rows.

我已经研究了很多类似ffbig -Family和matter的软件包,基本上所有内容都来自

I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html but this dose not seem to work well with tibbles and the tidyverse way of data processing.

那么,如何改善我的票据以处理海量数据? 我如何收集有关为什么R会话被终止的线索?

So, how can I improve my scrip to work with massiv amounts of data? How can I gather clues about why the R Session is Aborted?

推荐答案

在以下位置查看文章

datascience.la/dplyr-and-very-basic-benchmark

datascience.la/dplyr-and-a-very-basic-benchmark

有一张表显示您正在执行的某些数据整理任务的运行时比较.从表中可以看出,带有数据表的dplyr似乎要比带有数据框的dplyr更好.

There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.

也有一个指向用于制作表格的基准测试代码的链接.

There’s a link to the benchmarking code used to make the table, too.

简而言之,请尝试添加密钥,然后尝试在数据框上使用data.table.

In short, try adding a key, and try using data.table over dataframe.

要使用x作为密钥,并说您的数据表名为dt,请使用setkey(dt,x).

To make x your key, and say your data.table is named dt, use setkey(dt,x).

这篇关于将R与tidyquant和Massiv数据一起使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆