朱莉娅:vcat n数组的有效方法 [英] julia: efficient ways to vcat n arrays

查看:156
本文介绍了朱莉娅:vcat n数组的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个从json加载的类似于下面的数据结构

I have a data structure that I have loaded in from json that resembles the below

json_in =
  [ Dict("customer" => "cust1", "transactions" => 1:10^6)
  , Dict("customer" => "cust2", "transactions" => 1:10^6)
  , Dict("customer" => "cust3", "transactions" => 1:10^6)]

我知道两种将交易折叠成一个数组的方法

I know of two methods to collapse the transactions into one array

@time methodA = reduce(vcat,[cust["transactions"] for cust in json_in])
@time methodB = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])

但是,在我的计算机上,方法A的时间约为0.22s,而方法B的时间约为0.02s.我打算执行数千次,因此将性能提高10倍非常重要.

However the timing of methodA is ~0.22s vs ~0.02s for methodB on my computer. I intend to perform this thousands of times so 10x quicker performance is a big deal.

我看到methodB并不是很健壮,因为它只能处理3个Dicts(客户),因此即使它具有高性能,也不能一概而论.

I see methodB is not very robust as it can only deal with 3 Dicts (customers) so even though it's performant it doesn't generalise.

最有效地连接作为Dict数组中元素的数组的最有效方法是什么?

What would be the most efficient way to concatenate arrays that are elements in an array of Dict efficiently?

推荐答案

正如@Gnimuc在他的评论中指出的那样,您不应在全局范围内进行基准测试,最好使用BenchmarkTools.jl进行基准测试-以下是正确的时间安排:

As @Gnimuc states in his comment, you should not benchmark in global scope, and benchmarks are best done using BenchmarkTools.jl - here are the timings done right:

julia> methodA(json_in) = reduce(vcat,[cust["transactions"] for cust in json_in])
method1 (generic function with 1 method)

julia> methodB(json_in) = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])
method2 (generic function with 1 method)

#Gnimuc's syntax from his comment
julia> methodC(json_in) = mapreduce(x->x["transactions"], vcat, json_in)
method3 (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark methodA(json_in)
BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  15
  --------------
  minimum time:     10.584 ms (3.10% GC)
  median time:      14.781 ms (32.02% GC)
  mean time:        15.112 ms (32.19% GC)
  maximum time:     69.341 ms (85.28% GC)
  --------------
  samples:          331
  evals/sample:     1

julia> @benchmark methodB(json_in)
BenchmarkTools.Trial:
  memory estimate:  22.89 MiB
  allocs estimate:  2
  --------------
  minimum time:     5.921 ms (5.92% GC)
  median time:      8.402 ms (32.48% GC)
  mean time:        8.701 ms (33.46% GC)
  maximum time:     69.268 ms (91.09% GC)
  --------------
  samples:          574
  evals/sample:     1

julia> @benchmark methodC(json_in)
BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  12
  --------------
  minimum time:     10.599 ms (3.37% GC)
  median time:      14.843 ms (32.12% GC)
  mean time:        15.228 ms (32.24% GC)
  maximum time:     71.954 ms (85.95% GC)
  --------------
  samples:          328
  evals/sample:     1

方法B的速度仍然是以前的两倍.正是因为它在具有三个元素的数组上更加专业化.

Method B is still like twice as fast. That is exactly because it is more specialized, on an array with exactly three elements.

在这里可能会很好用的另一种解决方案是使用MappedArray,它会创建原始数组的惰性视图:

An alternative solution that might work well here is to use a MappedArray, which creates a lazy view into the original array:

using MappedArrays
method4(json_in) = mappedarray(x->x["transactions"], json_in)

当然这不会串联数组,但是您可以使用CatView包来串联视图:

Of course this doesn't concatenate the arrays, but you can concatenate views using the CatView package:

using CatViews
julia> method5(json_in) = reduce(CatView, mappedarray(x->x["transactions"], json_in))
method5 (generic function with 1 method)

julia> @benchmark method5(json_in)
BenchmarkTools.Trial:
  memory estimate:  1.73 KiB
  allocs estimate:  46
  --------------
  minimum time:     23.320 μs (0.00% GC)
  median time:      23.916 μs (0.00% GC)
  mean time:        25.466 μs (0.00% GC)
  maximum time:     179.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

因为不分配它,所以它的速度比方法B快300倍(但由于使用的是非本地性,因此值得使用该结果的速度较慢-值得进行基准测试).

Because it doesn't allocate it is like 300x faster than method B (but it's possible it's slower to use the result because of nonlocality - worth benchmarking).

这篇关于朱莉娅:vcat n数组的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆