JuliaDB 或 DataFrame 是否比普通数组更快? [英] Are JuliaDB or DataFrame faster than plain Array?

查看:14
本文介绍了JuliaDB 或 DataFrame 是否比普通数组更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道普通 Array 与 JuliaDB 或 DataFrame 在对巨大数据集(大但仍适合内存)进行计算时的性能是否存在差异?

I wonder if there's a difference in performance of plain Array versus JuliaDB or DataFrame to do calculations on huge data sets (large but still fit in memory)?

我可以使用普通数组和算法来进行排序、分组、归约等.那我为什么需要 JuliaDB 或 DataFrame?

I can use plain arrays and algorithms to do sorting, grouping, reducing etc. So why do I need JuliaDB or DataFrame?

我有点理解为什么 Python 需要 Pandas - 因为它将慢速 python 转换为快速 C.但为什么 Julia 需要 JuliaDB 或 DataFrame - Julia 已经很快了.

I kinda understand why Python needs Pandas - because it translates slow python into fast C. But why Julia needs JuliaDB or DataFrame - Julia already fast.

推荐答案

这可能是一个广泛的话题.让我强调一下我认为关键的功能.

This is a possibly broad topic. Let me highlight the features that are key in my opinion.

  1. 它们允许您存储具有不同类型的数据列.您可以在数组中执行相同的操作,但通常它们必须是 Any 的数组,这将比具有具体类型的数据列更慢并占用更多内存.
  2. 您可以使用名称访问列.但是,这是次要功能 - 例如.NamedArrays.jl 提供了一个具有命名维度的类数组类型.
  3. 额外的好处是,有一个基于列有名称的生态系统(例如,连接两个 DataFrame 或使用 GLM.jl 构建 GLM 模型).
  1. They allow you to store columns of data having different types. You can do the same in arrays, but then they have to be arrays of Any in general which will be slower and use up more memory than having data columns having concrete types.
  2. You can access columns using names. However, this is a secondary feature - e.g. NamedArrays.jl provides an array-like type with named dimensions.
  3. The additional benefit is that there is an ecosystem built on the fact that columns have names (e.g. joining two DataFrames or building GLM model using GLM.jl).

这种类型的存储(带有名称的异构列)是关系数据库中表的一种表示.

This type of storage (heterogeneous columns with names) is a representation of table in relational databases.

  1. JuliaDB.jl 支持分布式并行;DataFrames.jl 的正常使用假定数据适合内存(您可以使用 SharedArray 解决此问题,但这不是设计的一部分),如果您想并行计算,您必须手动执行;
  2. JuliaDB.jl 支持索引,而 DataFrames.jl 目前不支持;
  3. JuliaDB.jl 的列类型是稳定的,而 DataFrames.jl 目前还不是.后果是:
    • 使用 JuliaDB.jl 时,每次创建一种新的数据结构类型时,都必须重新编译应用于该类型的所有函数(对于大型数据集可以忽略,但在处理许多异构小型数据集时可能有明显的性能影响);
    • 在使用 DataFrames.jl 时,在某些情况下,您必须使用特殊技术确保类型推断以实现高性能(最值得注意的是所讨论的屏障函数 这里).
  1. JuliaDB.jl supports distributed parallelism; normal use of DataFrames.jl assumes that data fits into memory (you can work around this using SharedArray but this is not a part of the design) and if you want to parallelise computations you have to do it manually;
  2. JuliaDB.jl supports indexing while DataFrames.jl currently does not;
  3. Column types of JuliaDB.jl are stable and for DataFrames.jl currently they are not. The consequences are:
    • when using JuliaDB.jl each time a new type of data structure is created all functions that are applied over this type have to be recompiled (which for large data sets can be ignored but when working with many heterogeneous small data sets can have a visible performance impact);
    • when using DataFrames.jl you have to use special techniques ensuring type inference to achieve high performance is some situations (most notably barrier functions as discussed here).

这篇关于JuliaDB 或 DataFrame 是否比普通数组更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆