R 中对大数据集有哪些有用的优化? [英] What are helpful optimizations in R for big data sets?

查看:16
本文介绍了R 中对大数据集有哪些有用的优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了一个脚本,它适用于小数据集(<100 万行),但在大数据集上表现很差.我听说数据表比 tibbles 性能更高.除了了解数据表之外,我还有兴趣了解其他速度优化.

I built a script that works great with small data sets (<1 M rows) and performs very poorly with large datasets. I've heard of data table as being more performant than tibbles. I'm interested to know about other speed optimizations in addition to learn about data tables.

我将在脚本中分享几个命令作为示例.在每个示例中,数据集为 10 到 1500 万行和 10 到 15 列.

I'll share a couple of commands in the script for examples. In each of the examples, the datasets are 10 to 15 million rows and 10 to 15 columns.

  1. 获取按九个变量分组的数据框的最低日期

      dataframe %>% 
      group_by(key_a, key_b, key_c,
               key_d, key_e, key_f,
               key_g, key_h, key_i) %>%
      summarize(min_date = min(date)) %>% 
      ungroup()

  1. 对两个数据框进行左连接以添加额外的列

      merge(dataframe, 
          dataframe_two, 
          by = c("key_a", "key_b", "key_c",
               "key_d", "key_e", "key_f",
               "key_g", "key_h", "key_i"),
          all.x = T) %>% 
      as_tibble()

  1. 最近日期上加入两个数据框莉>
  1. Joining two dataframes on the closest date

      dataframe %>%
      left_join(dataframe_two, 
                  by = "key_a") %>%
      group_by(key_a, date.x) %>%
      summarise(key_z = key_z[which.min(abs(date.x - date.y))]) %>%
      arrange(date.x) %>%
      rename(day = date.x)

我可以应用哪些最佳实践,特别是我可以做些什么来使这些类型的函数针对大型数据集进行优化?

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

--

这是一个示例数据集

set.seed(1010)
library("conflicted")
conflict_prefer("days", "lubridate")
bigint <- rep(
  sample(1238794320934:19082323109, 1*10^7)
)

key_a <-
  rep(c("green", "blue", "orange"), 1*10^7/2)

key_b <-
  rep(c("yellow", "purple", "red"), 1*10^7/2)

key_c <-
  rep(c("hazel", "pink", "lilac"), 1*10^7/2)

key_d <-
  rep(c("A", "B", "C"), 1*10^7/2)

key_e <-
  rep(c("D", "E", "F", "G", "H", "I"), 1*10^7/5)

key_f <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_g <-
  rep(c("Z", "M", "Q", "T", "X", "B"), 1*10^7/5)

key_h <-
  rep(c("tree", "plant", "animal", "forest"), 1*10^7/3)

key_i <-
  rep(c("up", "up", "left", "left", "right", "right"), 1*10^7/5)

sequence <- 
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "1 day")

date_sequence <-
  rep(sequence, 1*10^7/(length(sequence) - 1))

dataframe <-
  data.frame(
    bigint,
    date = date_sequence[1:(1*10^7)],
    key_a = key_a[1:(1*10^7)],
    key_b = key_b[1:(1*10^7)],
    key_c = key_c[1:(1*10^7)],
    key_d = key_d[1:(1*10^7)],
    key_e = key_e[1:(1*10^7)],
    key_f = key_f[1:(1*10^7)],
    key_g = key_g[1:(1*10^7)],
    key_h = key_h[1:(1*10^7)],
    key_i = key_i[1:(1*10^7)]
  )

dataframe_two <-
  dataframe %>%
      mutate(date_sequence = ymd(date_sequence) + days(1))

sequence_sixdays <-
  seq(ymd("2010-01-01"), ymd("2020-01-01"), by = "6 days")

date_sequence <-
  rep(sequence_sixdays, 3*10^6/(length(sequence_sixdays) - 1))

key_z <-
  sample(1:10000000, 3*10^6)

dataframe_three <-
  data.frame(
    key_a = sample(key_a, 3*10^6),
    date = date_sequence[1:(3*10^6)],
    key_z = key_z[1:(3*10^6)]
  )

推荐答案

我可以应用哪些最佳实践,特别是我可以做些什么来使这些类型的函数针对大型数据集进行优化?

What best practices can I apply and, in particular, what can I do to make these types of functions optimized for large datasets?

使用data.table

library(data.table)
d1 = as.data.table(dataframe)
d2 = as.data.table(dataframe_two)


1

按多列分组是data.table擅长的
请参阅第二个图最底部的条形图,以与 dplyr spark 和其他人进行比较以进行这种分组
https://h2oai.github.io/db-benchmark

by_cols = paste("key", c("a","b","c","d","e","f","g","h","i"), sep="_")
a1 = d1[, .(min_date = min(date_sequence)), by=by_cols]

注意我将 date 更改为 date_sequence,我想您的意思是将其作为列名

note I changed date to date_sequence, I think you meant that as a column name

不清楚要合并表的哪些字段,dataframe_two没有指定字段所以查询无效
请澄清

it is unclear on what fields you want to merge tables, dataframe_two does not have specified fields so the query is invalid
please clarify

data.table 有一种非常有用的连接类型,称为滚动连接,它完全符合您的需要

data.table has very useful type of join called rolling join, which does exactly what you need

a3 = d2[d1, on=c("key_a","date_sequence"), roll="nearest"]
# Error in vecseq(f__, len__, if (allow.cartesian || notjoin || #!anyDuplicated(f__,  : 
#  Join results in more than 2^31 rows (internal vecseq reached #physical limit). Very likely misspecified join. Check for #duplicate key values in i each of which join to the same group in #x over and over again. If that's ok, try by=.EACHI to run j for #each group to avoid the large allocation. Otherwise, please search #for this error message in the FAQ, Wiki, Stack Overflow and #data.table issue tracker for advice.

这会导致错误.错误其实非常有用.在您的真实数据上,它可能工作得很好,因为错误背后的原因(匹配行的基数)可能与生成样本数据的过程有关.拥有用于加入的良好虚拟数据是非常棘手的.如果您在真实数据上遇到相同的错误,您可能需要查看该查询的设计,因为它试图通过执行多对多连接来进行行爆炸.即使已经只考虑了单个 date_sequence 标识(考虑到 roll).我不认为这种问题对该数据有效(严格来说,联接字段的基数).您可能希望在工作流程中引入数据质量检查层,以确保 key_adate_sequence 组合没有重复.

It results an error. Error is in fact very useful. On your real data it may work perfectly fine, as the reason behind the error (cardinality of matching rows) may be related to process of generating sample data. It is very tricky to have good dummy data for joining. If you are getting the same error on your real data you may want to review design of that query as it attempts to make row explosion by doing many-to-many join. Even after already considering only single date_sequence identity (taking roll into account). I don't see this kind of question to be valid for that data (cadrinalities of join fields strictly speaking). You may want to introduce data quality checks layer in your workflow to ensure there are no duplicates on key_a and date_sequence combined.

这篇关于R 中对大数据集有哪些有用的优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆